2

In Hive I ofter do queries like:

select columnA, sum(columnB) from ... group by ...

I read some mapreduce example and one reducer can only produce one key. It seems the number of reducers completely depends on number of keys in columnA.

Therefore, why could hive set number of reducers manully?

If there are 10 different values in columnA and I set number of reducers to 2, what will happen? Each reducers will be reused 5 times?

If there are 10 different values in columnA and I set number of reducers to 20, what will happen? hive will only generate 10 reducers?

leftjoin
  • 36,950
  • 8
  • 57
  • 116
user3692015
  • 391
  • 4
  • 15
  • 1
    reducer is not necessarily the same with the number of keys. BUT it is guaranteed that the given key will be processed by the same reducer. see the difference? – mangusta Jun 15 '20 at 05:32
  • 1
    so if there are 10 distinct keys in "col_A" and number of reducers is 2 then N keys will be processed by reducer_1 and the rest (10-N) keys will be processed by reducer_2. value of N (i.e. the way how keys are distributed across reducers ) is determined by hadoop – mangusta Jun 15 '20 at 05:35
  • 1
    setting number of reducers to the value larger than the number of distinct values has no meaning since the job would require at most as many reducers as there are distinct values (not more) – mangusta Jun 15 '20 at 05:36

1 Answers1

0

Normally you should not set the exact number of reducers manually. Use bytes.per.reducer instead:

--The number of reduce tasks determined at compile time
--Default size is 1G, so if the input size estimated is 10G then 10 reducers will be used
set hive.exec.reducers.bytes.per.reducer=67108864; 

If you want to limit cluster usage by job reducers, you can set this property: hive.exec.reducers.max

If you are running on Tez, at execution time Hive can dynamically set the number of reducers if this property is set:

set hive.tez.auto.reducer.parallelism = true;

In this case the number of reducers initially started may be bigger because it was estimated based on size, at runtime extra reducers can be removed.

One reducer can process many keys, it depends on data size and bytes.per.reducer and reducer limit configuration settings. The same keys will pass to the same reducer in case of query like in your example because each reducer container is running isolated and all rows having particular key need to be passed to single reducer to be able calculate count for this key.

Extra reducers can be forced (mapreduce.job.reducers=N) or started automatically based on wrong estimation(because of stale stats) and if not removed at run-time, they will do nothing and finish quickly because there is nothing to process. But such reducers anyway will be scheduled and containers allocated, so better do not force extra reducers and keep stats fresh for better estimation.

leftjoin
  • 36,950
  • 8
  • 57
  • 116