1

I understand that Resource Manager sends MapReduce Program to each Node Manager so that MapReduce gets executed in each Node.

But After seeing this image , I am getting confused on where actual Map & Reduce jobs executed and how shuffling is happening between Data Nodes ?

Is it not time taking process to sort and suffle/send data accross difference Data Node to perform Reduce Job ? Please explain me.

Also let me know what is Map Node and Reduce Node in this diagram. Image Src: http://gppd-wiki.inf.ufrgs.br/index.php/MapReduce

enter image description here

logan
  • 7,946
  • 36
  • 114
  • 185
  • Check out this [link](http://stackoverflow.com/questions/22141631/what-is-the-purpose-of-shuffling-and-sorting-phase-in-the-reducer-in-map-reduce) – laurentgir Apr 21 '15 at 13:26
  • @oftata : that link explains about map reduce. But I asked where map reduce is happening ? – logan Apr 21 '15 at 14:20
  • 1
    I agree, but there is a link in the answer to the Yahoo tutorial which answers your question. – laurentgir Apr 22 '15 at 05:18

1 Answers1

2

The input split is a logical chunk of the file stored on hdfs , by default an input split represents a block of a file where the blocks of the file might be stored on many data nodes in the cluster.

A container is a task execution template allocated by the Resource Manager on any of the data node in order to execute the Map/Reduce tasks.

First the Map tasks gets executed by the containers on data node where the container was allocated by the Resource Manager as near as possible to the Input Split's location by adhering to the Rack Awareness Policy (Local/Rack Local/DC Local).

The Reduce tasks will be executed by any random containers on any data nodes, and the reducers copies its relevant the data from every mappers by the Shuffle/Sort process.

The mappers prepares the results in such a way the results are internally partitioned and within each partition the records are sorted by the key and the partitioner determines which reducer should fetch the partitioned data.

By Shuffle and Sort, the Reducers copies their relevant partitions from every mappers output through http, eventually every reducer Merge&Sort the copied partitions and prepares the final single Sorted file before the reduce() method invoked.

The below image may give more clarifiations. [Imagesrc:http://www.ibm.com/developerworks/cloud/library/cl-openstack-deployhadoop/]

enter image description here

suresiva
  • 3,126
  • 1
  • 17
  • 23
  • thanks. but you have given Hadoop V1.0 Architecture. V2.0 does not have Job and Task Trackers – logan Apr 24 '15 at 12:00
  • yes, you are correct...I hope atleat it might help you to undestand the flow of shuffle/sort between mapper and reducer... – suresiva Apr 24 '15 at 13:23