Performance of Apache Drill

Question

Are there any performance benchmark(genuine ones) that compare Stinger vs Impala vs Drill? Also, which is preferred - my use case will be mainly towards ad-hoc interactive queries on top of Hive. Thanks.

score 5 · Accepted Answer · answered Aug 26 '15 at 18:16

5

There are some performance numbers on the site http://allegro.tech/fast-data-hackathon.html.

In general, we see Drill and Impala are comparable in performance for the interactive queries with the differentiation of Drill being its ability to query without metadata definitions and its ease of use working with JSON data.

Note that these tests are on much older versions on Drill such as 0.8/0.9 (also not configured appropriately for data locality). Now Drill is 1.1 with a lot of improvements on SQL (window functions etc) and performance.

answered Aug 26 '15 at 18:16

Neeraja Rentachintala

66
1

Thanks for your reply, what are your views on Stinger.next? How does it compare against Drill? Any benchmarks to determine which is faster? – Sai Aug 27 '15 at 03:04
Also, can Drill perform when dealing with datasets of TBs? I read that Impala and Presto are not suitable for complicated queries on huge datasets. – Sai Aug 27 '15 at 03:18

score 2 · Answer 2 · answered Oct 07 '16 at 09:44

2

You cannot do benchmark like this, it's no sense and you should never trust a such benchmark.

Everything will depend on your own data, you have JSON files ? prefer Drill. You want to query more than 1TB, prefer Hive and so on.

Also, you may consider file format, JSON, Kudu, Parquet or ORC.

Then come the optimization, Hive+Tez seems better for parrarel queries but very slow for single query. Whereas Impala is the opposite (MapReduce versus MassiveParrarelProcessing).

Also, you want to consider the hardware ressource, disk SSD or not etc..

I recommend, start with Apache Drill + JSON file, then try Apache Drill with Parquet or ORC.

If you want help, describe exactly what you have (data + hardware) and what you want.

answered Oct 07 '16 at 09:44

Thomas Decaux

21,738
2
113
124

Hi Thomas, I am trying to run large drill queries on a single node with 512 GB RAM and 48 CPUs. The query takes too long to run for around 30 GB data. It's taking more than 1 hour to finish aggregating all records. Do you have any tuning parameters which i need to check for this? – Srihari Karanth Jan 16 '17 at 09:59
1

1 node ? You must understand whats is Drill, like PrestoDB, Impala ... it's a MPP massively parallel processing engine, so, it's better to have several nodes ^^ – Thomas Decaux Jan 16 '17 at 10:38
2

Since we have 48 CPUs can we parallelize between these? – Srihari Karanth Jan 16 '17 at 11:51
I guess what he could have said is that the point of drill is to distribute the work among many small cheap workers to process huge amounts of data. If all your data fits in memory you might be better off using something else, there are some great in memory databases. – Dobes Vandermeer Oct 07 '18 at 15:16

Performance of Apache Drill

2 Answers2