1

I am confused since I have found two answers for it.

1) As per Hadoop Definitive Guide - 3rd edition, Chapter 6 - The Map Side says: "Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the back-ground thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort.

2)Yahoo developers tutorial (Yahoo tutorial) says Combiner runs prior to partitioner.

Can anyone please clarify which runs first.

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
avinash
  • 147
  • 3
  • 15

1 Answers1

7

A Map Reduce Job may contain one or all of these phases

  1. Map

  2. Combine

  3. Shuffle and Sort

  4. Reduce

Partitioner fits between second and third phase

You can visit this link for more details.

After going through related SE questions & articles,

What runs first: the partitioner or the combiner?

Who will get a chance to execute first , Combiner or Partitioner?

https://sreejithrpillai.wordpress.com/2014/11/24/implementing-partitioners-and-combiners-for-mapreduce/

we can see that opinion is divided.

But logically I feel that

  1. Mapper write outputs to Circular ring buffer in memory
  2. If Number of reducers are more than 1 & partitioner is in place, mapper output will be partitioned
  3. Once the buffer memory is full, output will be spilled over to the disk
  4. As per hadoop definitive guide "Within each partition, the back-ground thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort"

It implies that Partitioner should run first and combiner has to run on output data with-in each partition.

Community
  • 1
  • 1
Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
  • Thanks Ravindra. Partitions get created according to the number of reducers right. So my question is, are partitions created on node where the reduce task is supposed to run? – avinash Feb 05 '16 at 20:21
  • No. Assigning partition number happens at Mapper node. Reducer will get values after shuffling and sorting. You may have 100 mappers and 5 reducers. 5 reducers will get values after all 50 mappers complete execution and framework copy the output to Reducer nodes. – Ravindra babu Feb 06 '16 at 01:10
  • Let’s have an example. MR job is run on a cluster of 10 Datanodes. Image this job needs 10 mappers and 2 reducers. 1) Let’s say 2 map tasks are running concurrently on “5 DataNode”, so we get totally 10 mappers executed simultaneously. 2) The output from each map task (if Combiner is used then Combiner Result) is stored on local filesystem on each Datanode. 3) These intermediate data needs to be exchanged between all nodes (shuffle phase) and sorted and given to “2 reduce tasks”. So we had 5 Datanodes running map tasks. Which node does partition happens & how many partitions will be created? – avinash Feb 06 '16 at 06:48
  • Hope this example brings some more clarity to me. Am I missing some logic in that example. Your valuable inputs are much appreciated. – avinash Feb 06 '16 at 06:50