0

The default value for the number reducers is 1.Partitioner makes sure that same keys from multiple mappers goes to the same reducer but that does not mean the number of reducers will be equal to the number of partitions. From the driver one can specify the number of reducers using JobConf's conf.setNumReduceTasks(int num) or as mapred.reduce.tasks in the command line. If only the mappers are required then then we can set this as 0.

I have read regarding setting the number of reducers that:

  1. The number of reducers can be between 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node). I also read in the below link that increasing the number for reducers can have overhead:

Number of reducers in hadoop

  1. Also, I read in the below link that the number of reducers is best set to be the number of reduce slots in the cluster (minus a few to allow for failures):

What determines the number of mappers/reducers to use given a specified set of data

Based on the range specified in 1 and based on 2, how to decide on the optimal number for fastest processing?

Thanks.

Saurabh Rana
  • 168
  • 3
  • 22

1 Answers1

1

I want to know the approach for this in general.

This question can only have an empirical answer. Quoting from an answer in this Q&A

By default on 1 GB of data one reducer would be used. [...] Similarly if your data is 10 Gb so 10 reducer would be used .

Defaults already are the rule of thumb. You can further tweak the default number by making empirical tests and see how performances change. That is, currently, all.

Attersson
  • 4,755
  • 1
  • 15
  • 29
  • Thanks. That sounds like a good approach to start with the defaults and make empirical tests by tweaking the default number while observing performance changes. But the number based on 1) and 2) may be very far from the default.1) The number of reducers can be between 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node)2) Number of reducers is best set to be the number of reduce slots in the cluster (minus a few to allow for failures). – Saurabh Rana Oct 13 '20 at 11:08
  • 1
    Yes it may vary from the default and the significant width of the gap is a clear indication we currently have an empirical approach – Attersson Oct 13 '20 at 11:40
  • okay. Apart from starting with the defaults, 1) and 2) are different. I have read regarding both for the number of reducers and am not clear that between them which one to go forward with in the empirical approach while tweaking the default. – Saurabh Rana Oct 13 '20 at 14:14
  • Try the ranges in 1). Anyway, just adjust the number of reducers and benchmark the performances. Try any number. – Attersson Oct 13 '20 at 21:22