About MR inputsplit

Question

As i understood that File splitting at the time of copying the file into HDFS and input splits on file for mapper input are entirely two different approaches.

Here it is my Question--

Suppose if my File1 size is 128MB which was split ted into two blocks and stored in two different data nodes (Node1,Node2) in hadoop cluster. I want to run MR job on this file and got two input splits of the sizes are 70MB and 58 MB respectively. First mapper will run on node1 by taking the input split data (Of size 70 MB) but Node1 has 64MB data only and remaining 6 MB data presented in Node2.

To complete Map task on Node1, Does hadoop transfer 6MB of data from Node2 to Node1? if yes, what if Node1 do not have enough storage to store 6MB data from Node2.

My apologies if my concern is awkward.

score 0 · Answer 1 · edited May 23 '17 at 11:44

64 MB of data will be written in Node 1 and 6 MB of data will be written in Node 2.

Map Reduce algorithm does not work on physical blocks of the file. It works on logical input splits. Input split depends on where the record was written. A record may span two Mappers.

In your example, assume that record start after 63 KB of data and length of record is 2 MB. In that case, 1 MB is part of Node 1 and other 1 MB is part of Node 2. Other 1 MB of data will be transferred from Node 2 to Node 1 during Map operation.

Have a look at below picture for better understanding of logical split Vs physical blocks

Have a look at some SE questions :

How does Hadoop process records split across block boundaries?

About Hadoop/HDFS file splitting

MapReduce data processing is driven by this concept of input splits. The number of input splits that are calculated for a specific application determines the number of mapper tasks.

Each of these mapper tasks is assigned, where possible, to a slave node where the input split is stored. The Resource Manager (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits are processed locally.

If data locality can't be achieved due to input splits crossing boundaries of data nodes, some data will be transferred from one Data node to other Data node.

If you think this answer address your question, accept it as answer. — Ravindra babu, Nov 19 '15 at 04:13

About MR inputsplit

1 Answers1

Linked