Does the splits like FileSplit in Hadoop change the blocks?

Question

First Question: I want to know if the Splits change the blocks in any means (i.e. change size, shift the block to another location, create new blocks, ...).

Second Question: I think the splits doesn't change the blocks but it specifies where each MapTask should exist and run on the cluster for locality of data or rack awareness because the DataNodes are already running and are having the blocks so the splits I think will tell Hadoop to run the MapTask beside the node that contains the data. Notice: Inside the InputSplit there is the Location/Host which I think for this purpose. Please correct me if I am wrong

Third Questions: Initially before actually executing the task, will the blocks move to where the MapTask is or the MapTask will move to where the blocks are (i.e. Location of the DataNode)?

score 0 · Accepted Answer · edited May 23 '17 at 10:30

For your first and second questions:

Blocks won't change with splits. To prepare Input Split, some data from DataNode block may be copied to other DataNode, on which Map task is getting executed ( If data is overlapped between multiple data blocks)

Third Questions: Initially before actually executing the task, will the blocks move to where the MapTask is or the MapTask will move to where the blocks are (i.e. Location of the DataNode)?

If a MapTask is fetching data from DataNode A / Block-A and some part of data in DataNode A/Block-A spans into DataNode B/Block-B, then data from Block-B will be copied to Mapper (DataNode-A).

Refer to below question for better understanding of input split & Data blocks :

How does Hadoop perform input splits?

Does the splits like FileSplit in Hadoop change the blocks?

1 Answers1