What is "cold start" in Hive and why doesn't Impala suffer from this?

Question

I'm reading the literature on comparing Hive and Impala.

Several sources state some version of the following "cold start" line:

It is well known that MapReduce programs take some time before all nodes are running at full capacity. In Hive, every query suffers this “cold start” problem.

Reference

In my opinion, it is not sufficient to understand what is meant by "cold start". Looking for more information and clarity to understand this.

For context, I'm a data scientist. I create queries, and have only basic understanding of big data concepts.

I've referred to questions that explain why Impala is faster (example), but they don't explicitly address or define cold start.

score 1 · Accepted Answer · answered Dec 20 '21 at 21:27

With every Hive query, a MapReduce Job is executed which requires overhead and time for nodes within the MapReduce cluster to work on the task. This is known as "cold start". On the other hand, because Impala sits directly atop HDFS, it does not invoke a MapReduce job and avoids the overhead and time needed in a MapReduce job. Rather, Impala daemon processes are active at boot time and ready to process queries.

Takeaway: cold start refers to the overhead required in booting and executing a MapReduce job.

What is "cold start" in Hive and why doesn't Impala suffer from this?

1 Answers1