0

I tried to run the SQL like the following:

select count(*) from test_table where columna='a' and columnb in ('test1', test2')

For Impala in Cloudera, it takes around 2 mins, but for Hive, it takes 20mins, not sure is this normal? if yes, why does Impala run much faster than Hive in Cloudera? and in which kind of scenario will Hive be faster than Impala?

Thanks.

tk421
  • 5,775
  • 6
  • 23
  • 34
  • Hive is for more complex queries and really big data. https://data-flair.training/blogs/impala-vs-hive/ – leftjoin Aug 24 '18 at 17:44
  • 1
    The version of Hive bundled by Cloudera will _never_ be faster than Impala -- because Impala is sponsored by Cloudera, and positioned as an market advantage (by their marketing), while the Hive extensions are sponsored by HortonWorks (Tez, LLAP...) Also, old-school Hive generates batch job, so you have a 20-30s overhead from the start (when you are lucky enough to have available resources). While Impala is built for interactive queries, pre-allocates large amounts of CPU/RAM... and is developed in C++ in an aim for performance (at loss of extensibility) – Samson Scharfrichter Aug 24 '18 at 18:32
  • 1
    Impala is a Massively Parallel Processing engine (MPP) and does in memory processing thereby giving instant results. Having worked on CDH 5.3.x I have worked on mainly Hive on Map Reduce jobs. Since its Map Reduce due to disk and network I/O involved, its comparatively much slower – Pushkin Aug 26 '18 at 06:40
  • Possible duplicate of [How does impala provide faster query response compared to hive](https://stackoverflow.com/questions/16755599/how-does-impala-provide-faster-query-response-compared-to-hive) – tk421 Aug 27 '18 at 21:51

0 Answers0