Hadoop Map/Reduce Job distribution

Question

I have 4 nodes and I am running a mapreduce sample project to see if job is being distrubuted between all 4 nodes. I ran the project mulitple times and have noticed that, the mapper task is being splitted among all 4 nodes but the reducer task is only being done by one node. Is this how it is suppose to be or is reducer task suppose to be split among all 4 nodes as well.

Thank you

A very similar question to the one you have just asked can be found here http://stackoverflow.com/questions/6885441/setting-the-number-of-map-tasks-and-reduce-tasks — Sudarshan, Apr 29 '14 at 08:36

score 0 · Answer 1 · answered Apr 28 '14 at 16:46

Distribution of Mappers depends on which block of data the mapper will operate on. Framework by default tries to assign the task to a node which has the block of data stored. This will prevent network transfer of data.

For reducers again it depends on no. of reducers which your job requires. If your job uses only one reducer it may be assigned to any pf the nodes.

Also impacting this is speculative execution. If on then this results in multiple instances of map task/ reduce task to start on different nodes and the job tracker based on % completion decides which one will go through and other instances will be killed.

score 0 · Answer 2 · answered Apr 29 '14 at 07:31

Let us say you 224 MB file. When you add that file into HDFS based on the default block size of 64 MB, the files are split into 4 blocks [blk1=64M,blk2=64M,blk3=64M,blk4=32M]. Let us assume blk1 in on node1 represented as blk1::node1, blk2::node2, blk3:node3, blk4:node4. Now when you run the MR, the Map needs to access the input file. So MR FWK creates 4 mappers and are executed on each node. Now comes the Reducer, as Venkat said it depends on no.of reducers configured for your job. The reducers can be configured using the Hadoop org.apache.hadoop.mapreduce.Job setNumReduceTasks(int tasks) API.

Hadoop Map/Reduce Job distribution

2 Answers2