1

I want to understand how mapreduce job process the data blocks. As per my understanding for each data block one mapper is invoked. Let me put my query with one example.

Suppose, I have a long text file having data stored in HDFS in 4 blocks ( 64 MB ) on 4 nodes (let's forget about the replication here )

In this case 4 map task will be invoked on each machine ( all 4 data nodes/machines) Here is question : this splitting may have resulted partial record stored on two blocks. Like last record may have been get stored in block 1(at end) partially and other part on block 2. In this case, how does mapreduce program ensure that complete record is getting processed?

I hope, I have been able to put my query

  • Have a look at [this answer](http://stackoverflow.com/questions/14291170/how-does-hadoop-process-records-split-across-block-boundaries/14540272#14540272). It is a very thorough explanation. – PetrosP May 12 '16 at 09:53

1 Answers1

0

Please read this article at http://www.hadoopinrealworld.com/inputsplit-vs-block/

This is an excerpt from the same article I posted above.

A block is a hard division of data at the block size. So if the block size in the cluster is 128 MB, each block for the dataset will be 128 MB except for the last block which could be less than the block size if the file size is not entirely divisible by the block size. So a block is a hard cut at the block size and blocks can end even before a logical record ends.

Consider the block size in your cluster is 128 MB and each logical record is your file is about 100 Mb. (yes.. huge records) So the first record will perfectly fit in the block no problem since the record size 100 MB is well with in the block size which is 128 MB. However the 2nd record can not fit in the block, so the record number 2 will start in block 1 and will end in block 2. If you assign a mapper to a block 1, in this case, the Mapper can not process Record 2 because block 1 does not have the complete record 2. That is exactly the problem InputSplit solves. In this case InputSplit 1 will have both record 1 and record 2. InputSplit 2 does not start with Record 2 since Record 2 is already included in the Input Split 1. So InputSplit 2 will have only record 3. As you can see record 3 is divided between Block 2 and 3 but still InputSplit 2 will have the whole of record 3.

Blocks are physical chunks of data store in disks where as InputSplit is not physical chunks of data. Inputsplit is a Java class with pointers to start and end locations in blocks. So when Mapper tries to read the data it clearly knows where to start reading and where to stop reading. The start location of an InputSplit can start in a block and end in another block.

Happy Hadooping

sutterhome1971
  • 380
  • 1
  • 9
  • 22