What determines the number of mappers/reducers to use given a specified set of data

Question

What are the factors which decide the number of mappers and reducers to use for a given set of data to achieve optimal performance? I am talking in terms of Apache Hadoop Map Reduce platform.

well that completely depends on the data you want to process and the time in which you want to process that data — banjara, Oct 17 '12 at 10:16
@zuxqoj - It will be great if you can explain that with the help of an example. — Abhishek Jain, Oct 17 '12 at 10:19

score 5 · Accepted Answer · answered Oct 18 '12 at 02:08

5

According to the Cloudera blog

Have you set the optimal number of mappers and reducers?
The number of mappers is by default set to one per HDFS block. This is usually a good default, but see tip 2.
The number of reducers is best set to be the number of reduce slots in the cluster (minus a few to allow for failures). This allows the reducers to complete in a single wave.

answered Oct 18 '12 at 02:08

Praveen Sripati

32,799
16
80
117

I guess the number of mappers being set to one per HDFS block is because of the fact that data to a maximum of one block size is guaranteed to be present on a single node by HDFS and if the map task runs on that node then network i/o can be saved upon due to data locality. Anyways, I can accept your answer and many thanks for it. – Abhishek Jain Oct 18 '12 at 04:39
1

Yes - your understanding is correct. – Praveen Sripati Oct 18 '12 at 05:37
this suggestion is wrong for clusters with more then 10 hosts. reducers should be set to amount of output data, not a cluster capacity. cluster capacity should be upper bound. Especially if mapred.reduce.slowstart.completed.maps is low (0.05 for example). If you set 1500 reducers for output data from mappers say 1GB, you will get: 1500xMappers http requests and data downloads. You get 1500xMappers small files across cluster. – octo Oct 18 '12 at 12:10

score 2 · Answer 2 · edited Jun 20 '20 at 09:12

2

Mainly, number of mappers depends on amount of InputSplit generated by InputFormat#getInputSplits method. In particular FileInputSplit splits input directory in respect to blocks and files. Gzipped files don't split and whole input file passed to 1 mapper.

Two files:
f1 [ block1, block2], 
f2 [block3, block4] 
becomes 4 mappers 
f1(offset of block1), 
f1(offset of block2), 
f2(offest of block3),
f2(offset of block4)

Other InputFormat has its own methods for files splitting (for example Hbase splits input on region boundaries).

Mappers can't be effectively controlled, except by using CombineFileInputFormat. But most mappers should be executed on host, where data resides.

Number of reduces in most cases specified by users. It mostly depends on amount of work, which need to be done in reducers. But their number should not be very big, because of algorithm, used by Mapper to distribute data among reducers. Some frameworks, like Hive can calculate number of reducers using empirical 1GB output per reducer.

General rule of thumb: use 1GB per reducer, but not more then 0.8-1.2 of your cluster capacity.

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 17 '12 at 10:57

octo

665
3
8

@octo-I was expecting an answer in terms of what is the optimal number of mappers and reducers to be used while given a certain set of data. I guess it should be related to the cluster capacity somehow so that optimal cluster utilisation is ensured and network bandwidth may also be one of the parameters. I am looking for something on those lines. – Abhishek Jain Oct 17 '12 at 20:06
I gave you answer. Mappers number mainly depends on layout of input data. Reducers as I said: 1GB per reducer. Mappers, mostly works well with same amount of data, but on most clusters block size is 64-256MB, so each mapper recieves 1 block. This numbers are balance between task startup cost and probability of slow task. And number of mappers should not be related to capacity of you cluster, those numbers should minimize network/io usage. Reducers should not exceed cluster capacity. – octo Oct 17 '12 at 21:18
@octo- Your answer seems clearer now. I guess each mapper gets one block because data to a maximum of one block size is guaranteed to be present on a single node by HDFS and if the map task runs on that node then network i/o can be saved upon due to data locality. Thanks for the answer though. Reducers anyways should be directly proportional to the number of reduce slots in the cluster. – Abhishek Jain Oct 18 '12 at 04:38

What determines the number of mappers/reducers to use given a specified set of data

2 Answers2

Linked