MapReduce: How input splits are done when 2 blocks are spread across different nodes?

Question

I read following wiki but still not able to clarify one thing.

https://wiki.apache.org/hadoop/HadoopMapReduce

Say, I have a large file that's broken into two HDFS blocks and the blocks are physically saved into 2 different machines. Consider there is no such node in the cluster that locally hosts both the blocks. As I understood in case of TextInputFormat HDFS block size is normally same as the split size. Now since there are 2 splits, 2 map instances will be spawned in 2 separate machines which locally hold the blocks. Now assume that the HDFS text file had been broken in middle of a line to form the blocks. Would hadoop now copy block 2 from 2nd machine into the first machine so it could provide the first line(broken half) from 2nd block to complete the last broken line of the first block?

Have a look at this http://stackoverflow.com/questions/14291170/how-does-hadoop-process-records-records-split-across-block-boundaries — Magham Ravi, Jun 27 '13 at 04:17
Thanks Magham, that was really helpful. So practically every mapper will have to copy the next block from another datanode. So it's only half local task. — Arijit Banerjee, Jun 27 '13 at 04:55
Refer another discussion on same topic. http://stackoverflow.com/questions/14291170/how-does-hadoop-process-records-records-split-across-block-boundaries — Saket, Mar 17 '14 at 09:47

score 4 · Accepted Answer · answered Jun 28 '13 at 01:12

Now assume that the HDFS text file had been broken in middle of a line to form the blocks. Would hadoop now copy block 2 from 2nd machine into the first machine so it could provide the first line(broken half) from 2nd block to complete the last broken line of the first block?

Hadoop doesn't copy the blocks to the node running the map task, the blocks are streamed from the data node to the task node (with some sensible transfer block size such as 4kb). So in the example you give, the map task that processed the first block will read the entire first block, and then stream read the second block until it finds the end of line character. So it's probably 'mostly' local.

How much of the second block is read depends on how long the line is - it's entirely possible that a file split over 3 blocks will be processed by 3 map tasks, with the second map task essentially processing no records (but reading all the data from block 2 and some of 3) if a line starts in block 1 and ends in block 3.

Hope this makes sense

Now in your example where a huge single line that is spread across 3 blocks and ends somewhere in block 3 - I understood the 2nd mapper will read its own input split, i.e. the second block(but just skip it). But why would the second mapper go to block 3? — Arijit Banerjee, Jun 28 '13 at 05:05
It woulldn't go into block 3 unless it is currently processing a line from block 2 and was looking for the EOL character for that record. Map task 2 will stream over block 2, never find a EOL character and terminate when it reaches the end of block 2. — Chris White, Jun 28 '13 at 10:26

MapReduce: How input splits are done when 2 blocks are spread across different nodes?

1 Answers1