0

When executing a MR job, Hadoop divides the input data into N Splits and then starts the corresponding N Map programs to process them separately.
1.How is the data divided (splited into different inputSplits)?
2.How is Split scheduled (how do you decide which TaskTracker machine the Map program that handles Split should run on)?
3.How to read the divided data?
4.How Reduce task assigned ?
In hadoop1.X
enter image description here
In hadoop 2.x
enter image description here

Both of the questions has some relationship , so I asked them together , you can show which of the part you are good at .

thanks in advance .

HbnKing
  • 1,762
  • 1
  • 11
  • 25

1 Answers1

0

Data is stored/read in HDFS Blocks of a predefined size, and read by various RecordReader types by using byte scanners, and knowing how many bytes to read in order to determine when an InputSplit needs to be returned.

A good exercise to understand it better is to implement your own RecordReader and create small and large files of one small record, one large record, and many records. In the many records case, you try to split a record across two blocks, but that test case should be the same as one large record over two blocks.

Reduce tasks can be set by the client of the MapReduce action.

As of Hadoop 2 + YARN, that image is outdated

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245