- 1 1. Start in minutes
- 1.1 2. JSON model without schema
- 1.2 3. Querying complex, semi-structured data in-situ
- 1.3 4. Real SQL – not “SQL-like”
- 1.4 5. Take advantage of standard BI tools
- 1.5 6. Interactive queries in hive tables
- 1.6 7. Access multiple data sources
- 1.7 8. User Defined Functions (UDFs)
- 1.8 9. High performance
- 1.9 10. Scales up to a 1000-node cluster from a single laptop
1. Start in minutes
It takes only a few minutes to start working with Apache Drill. Runs a query on your Mac or Windows laptop and untar in a local file. No need to build any infrastructure. No need to define schemas. Just watch the data and practice!
2. JSON model without schema
Drill is the world’s first and only distributed SQL engine with no schema requirements. It shares the same type schema-free JSON model as MongoDB and Elasticsearch. Instead of defining schemas, converting data (ETL), and spending weeks or months to retain those schemes, mark your data (file, directory, HBase table, etc.) and run your queries. Apache Drill automatically understands the structure of the data. Drill’s self-service approach reduces the burden on IT and improves the efficiency and agility of analysts and developers.
3. Querying complex, semi-structured data in-situ
Drill’s schema-free JSON model allows you to query complex, semi-structured data on-site. There is no need to flatten or convert data before or during query execution. Drill also provides SQL intuitive extensions to work with nested data.
4. Real SQL – not “SQL-like”
Drill supports the standard SQL: 2003 syntax. There is no need to learn a new “SQL-like” language or fight against a semi-functional BI tool. It supports many types of data such as Drill, DATE, INTERVAL, TIMESTAMP, VARCHAR, and DECIMAL, as well as complex query structures such as associated subqueries and assemblies in where clauses.
5. Take advantage of standard BI tools
Drill works with standard BI tools. You can still use your favorite tools like Tableau, MicroStrategy, QlikView and Excel. You don’t need to introduce another visualization or control panel tool. Combine a self-service BI tool with one self-service SQL engine to enable true self-service data research.
6. Interactive queries in hive tables
Apache Drill allows you to use your investments in Hive. You can run interactive queries with Drill in your Hive tables and access all Hive input / output formats (including special SerDes). You can combine tables associated with different Hive metastores and you can join a Hive table with a HBase table or a log files directory.
7. Access multiple data sources
The drill is designed with extensibility in mind. Provides a ready-to-use connection to file systems (local or distributed file systems such as S3, HDFS and MapR-FS), HBase and Hive. You can implement a storage add-on to allow the drill to work with another data source. The drill can combine data from multiple data sources in a single query without central metadata definitions.
8. User Defined Functions (UDFs)
Drill creates a simple, high-performance Java API to create custom functions (UDFs and UDAFs) so you can add your own business logic. If you have already created UDFs in Hive, you can re-use them with Drill without making any changes.
9. High performance
The drill prepares the ground for high efficiency and low latency. MapReduce does not use a general purpose execution engine, such as a Thesis or Spark. As a result, Drill is able to offer its unique flexibility (schematic-free JSON model) without sacrificing performance. Drill’s optimizer uses rule and cost-based techniques, as well as data location and operator push down (ability to transfer query parts to back-end data sources). Apache Drill also provides a column and vectorized execution engine, resulting in higher memory and CPU efficiency.
10. Scales up to a 1000-node cluster from a single laptop
The drill is available as a simple download that you can run on your laptop. When you are ready to analyze larger data sets, simply place Drill in your Hadoop cluster (up to 1000 commodity servers). The drill uses the total memory in the cluster to execute queries using an optimistic pipeline model and automatically dumps it on disk when the work set does not fit into memory.