Determining optimal number of reducers in Yarn

Question

In MRv1 we had the below two configurable parameters to set the number of Map and reduce slots per Node.

mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum

Also it was advisable to have number of Map slots little higher than the number of Reduce slots. Ideal number of reducers for a Map Reduce job would be equal to or greater than number of reduce slots available in the cluster.

Please correct if my above understanding is not correct wrt MRv1...

In MRv2 we dont have the concept of slots anymore, instead containers provide the required memory and CPU for Map/Reduce taks execution.

Here comes my question, How to decide on number of reducers for any Map Reduce job in MRv2 ?

Thanks

score 0 · Answer 1 · answered Jun 23 '15 at 09:07

mapred.tasktracker.reduce.tasks.maximum is replaced by

mapreduce.tasktracker.reduce.tasks.maximum

This property denotes the maximum number of concurrent reduce slots a given task tracker node can run.

mapred.tasktracker.map.tasks.maximum is replaceb by

mapreduce.tasktracker.map.tasks.maximum

This property denotes the maximum number of concurrent map slots a given task tracker node can run.

With YARN and MapReduce 2, there are no longer pre-configured static slots for Map and Reduce tasks. The entire cluster is available for dynamic resource allocation of Maps and Reduces as needed by the job.

But If you want to assign number of reducer to your job, you can still do it by specifying following property in your Map/Reduce job.

mapreduce.job.reduces

Please see this link to know more about it.

Number of Mapper is basically allocated based on number of input split of your data. Suppose you are dealing with 1GB data-set and HDFS block size is 128MB and you have not specified any split size in your job then 1GB/128MB=8 split will be considered and 8 Mapper will beallocated to this job but suppose if you have specified split size 512MB in your code then 1GB/512MB=2mapper will be considered and allocated to thisjob.

Please see this link to understand more about it.

Thanks for the above information. But my question was to understand how to decide on the number of reducers for any map reduce job. For Eg : A cluster with 40 datanodes, 12 cores per node, 96GB memory per Node What would be the optimal number of reducers? or If you can explain how many reducers you have used for any of your use case.. It helps to get a clear picture... — vin15, Jun 24 '15 at 07:35
Please check my answer here - http://stackoverflow.com/questions/30368437/reducers-for-hive-data/30371178#30371178 Please let me know if you have any confusion to understand it. — Sandeep Singh, Jun 24 '15 at 16:19
If you want to choose reducer count from yourself then you need to keep following thing in mind 1. What is total input data size(ideally one reducer should be assigned for 1 GB data) 2. How many keys are there to perform reduce phase(one reducer for 1 key should be decided). Please see this article to understand more. https://github.com/paulhoule/infovore/wiki/Choosing-the-number-of-reducers — Sandeep Singh, Jun 24 '15 at 16:38

Determining optimal number of reducers in Yarn

1 Answers1