1

Assume there are two queries:

  1. select count(distinct a) from x;

  2. select count(*) from (select distinct a from x) y;

I know they return the same results, but from the perspective of Hive (using MapReduce). Can anyone please explain which one is the better choice and why?

Any help is appreciated.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
SuperDelta
  • 253
  • 2
  • 13

2 Answers2

2

In Hive versions prior 1.2.0 the first query executes using one Map and one Reduce stages. Map sends each value to the single reducer, and reducer does all the job. count(distinct)

Single reducer processing too much data in this case.

During second query execution, mappers output distributed between many reducers, each reducer generates it's distinct list and final map-reduce job does summarize the size of each list. count(*) from (select distinct)

Since Hive 1.2.0 Hive 1.2.0+ provides auto-rewrite optimization hive.optimize.distinct.rewrite=true/false, see HIVE-10568

See also Hortonworks community

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Thanks for your reply. I can understand that one reducer processing too much data may cause data skew. But can you please explain more on why the `count(distinct )` uses the multi-mapper & one-reducer stage? – SuperDelta Jul 25 '18 at 02:43
  • @SuperDelta It's data skew may cause reducer(join) processing too much data. And too much data, does not matter skew or not skew caused it, runs slow. And as you can see, final count() stage realized on single reducer. When you put distinct inside, it also executes on the same reducer, exactly as instructed, without query rewrite. As I noted, as of Hive 1.2.0 there is no such problem with count(distinct) because it works smarter. Also map-side aggregation should help a lot w queries like this. – leftjoin Jul 25 '18 at 04:58
  • Thanks for your detailed reply, but I'm still confused: In Hive versions prior 1.2.0, in the 2nd query, `distinct()` is processed via multi-reducer, while in the 1st query, distinct() is processed via only one reducer. Also, multi mappers means multi keys, which should be sent to multi reducers. But in the 1st query, it seems they are forced to be sent to the same reducer.. why? – SuperDelta Jul 26 '18 at 02:56
  • @SuperDelta No, it is count() using one final reducer at the end. and distinct being placed inside count() is being calculated on the same single reducer. Just execute count() and you will see what is happening. It will run MR with single reducer. Also run distinct alone. It runs on many reducers in parallel. – leftjoin Jul 26 '18 at 11:04
  • @SuperDelta And when distinct is calculated, rows are distributed between many reducers by a key and they are calculated independently each it's own set of keys, does not require single reducer. – leftjoin Jul 26 '18 at 11:11
  • Does it mean that when MR sees `count()`, it directly arranges it to one reducer regardless of the `distinct` inside? – SuperDelta Jul 27 '18 at 04:10
  • @SuperDelta For count without group by - yes, single final reducer is necessary. But `Count() group by key` should be able to run on many reducers as well because it is possible to distribute data between reducers by key. And in this case there is no need to have single reducer, you do not need to sum each key count into single result, it may end up with many reducers. Just check how it works. And on different Hive versions it may run differently – leftjoin Jul 27 '18 at 05:01
  • Thanks for your reply man. I want to know why `group by` will trigger multi reducers but I would like to learn it by myself... So can you please recommend some material regarding how MR assign mapper/reducer for HIVE? – SuperDelta Jul 27 '18 at 06:44
  • @SuperDelta http://shop.oreilly.com/product/0636920025122.do, https://cs.uwaterloo.ca/~jimmylin/publications/WWW2013-MapReduce-tutorial-slides.pdf Google about Map-reduce algorithms implementation – leftjoin Jul 27 '18 at 06:53
2

2nd Query: select count(*) from (select distinct a from x) y;

is upto 3.x faster than

1st Query: select count(distinct a) from x;

Please refer to https://issues.apache.org/jira/browse/HIVE-10568

Executed both the queries in Hive, first query executed in 1 stage with 1 reducer.

MapReduce Jobs Launched: Stage-Stage-1: Map: 3 Reduce: 1 Cumulative CPU: 46.51 sec HDFS Read: 42857 HDFS Write: 4 SUCCESS Total MapReduce CPU Time Spent: 46 seconds 510 msec

Second query executed in 2 stages, with improved parallelism.

MapReduce Jobs Launched: Stage-Stage-1: Map: 3 Reduce: 1 Cumulative CPU: 13.93 sec HDFS Read: 42857 HDFS Write: 115 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 5.83 sec HDFS Read: 510 HDFS Write: 4 SUCCESS Total MapReduce CPU Time Spent: 19 seconds 760 msec

Lakshman Battini
  • 1,842
  • 11
  • 25