0
  1. Let’s say a 64 MB block is on node A and replicated among 2 other nodes (B, C), and the input split size for the map-reduce program is 64 MB, will this split just have location for node A? Or will it have locations for all the three nodes A,b,C?
  2. Since data is local to all the three nodes how the framework decides (picks) a map task to run on a particular node?
  3. How is it handled if the Input Split size is greater or lesser than block size?
Manjunath Ballur
  • 6,287
  • 3
  • 37
  • 48
  • 1
    Possible duplicate of [Hadoop input split size vs block size](http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size) – Rijul Nov 30 '16 at 12:12

1 Answers1

0

Hadoop knows where the blocks are located. If the split is exactly equal to one block, then Hadoop will try to run the map task on the same node to apply the "data locality" principle and save any network transfers needed.

If A, B, C are all available then the map task will be run on the node closest to the client. If node A is not available then it runs on the B or C depending on which one is closest to the client.

If A, B, C are all not available then Hadoop will find out which node out of A, B, or C is closest to the client and then select a free node on same rack as A (because intra-rack transfers are faster). If the whole rack is busy then it will have no choice but to choose a different rack and node to process the split. The split will be temporarily copied to the node and after processing, it will be deleted from the temporary location.

If however, the input split is greater than the block size then the exact same principle applies. The only difference is that the Hadoop framework will give 'split + few lines from block 2' to the node for processing.

kashmoney
  • 97
  • 7