The default value for the number reducers is 1.Partitioner makes sure that same keys from multiple mappers goes to the same reducer but that does not mean the number of reducers will be equal to the number of partitions. From the driver one can specify the number of reducers using JobConf's conf.setNumReduceTasks(int num)
or as mapred.reduce.tasks
in the command line. If only the mappers are required then then we can set this as 0.
I have read regarding setting the number of reducers that:
- The number of reducers can be between 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node). I also read in the below link that increasing the number for reducers can have overhead:
- Also, I read in the below link that the number of reducers is best set to be the number of reduce slots in the cluster (minus a few to allow for failures):
What determines the number of mappers/reducers to use given a specified set of data
Based on the range specified in 1 and based on 2, how to decide on the optimal number for fastest processing?
Thanks.