14

As per definition "The Combiner may be called 0, 1, or many times on each key between the mapper and reducer."

I want to know that on what basis mapreduce framework decides how many times cobiner will be launched.

banjara
  • 3,800
  • 3
  • 38
  • 61

3 Answers3

23

Simply the number of spills to disk. Sorting happens after the MapOutputBuffer filled up, at the same time the combining will take place.

You can tune the number of spills to disk with the parameters io.sort.mb, io.sort.spill.percent, io.sort.record.percent - those are also explained in the documentation (books and online resources).

Example for specific numbers of combiner runs:

0 -> no combiner was defined

1 -> a combiner was defined and the MapOutputBuffer filled up once

>1 -> a combiner was defined and the MapOutputBuffer filled up more than once

Note that even if the MapOutputBuffer never fills up completely, this buffer must be flushed at the end of the map stage and thus triggers the combiner to run at least once (if defined).

Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91
  • Thanks for the reply, I checked description of these config parameters in definitive guide, Now i understand the case when multiple combiners will be used. But I still miss the case when no combiner will be launched. Can you please help me with it. I understand the use case of having insufficient number of records to combine but i miss the configs for it – banjara Jun 18 '13 at 09:38
  • Combining will always be called when the buffer needs to be flushed. Thus at the end of the map stage the combiner (if defined) needs to run at least once. What kind of concrete problems are you facing? I edited and added some examples for you – Thomas Jungblut Jun 18 '13 at 09:42
  • 1
    Inorder to optimize my MR jobs i was thinking about introducing combiner and I was studying about them. I read in many blogs that Combiners are not guaranteed to run (http://dataworld.blog.com/2013/04/30/just-a-little-about-combiner-of-mapreduce-framework/). I have one more question, combiner is launched per Mapper or per data node machine?? I used to believe combiner is part of mapper phase but http://developer.yahoo.com/hadoop/tutorial/module4.html suggests that combiner is launched per machine. Can you please help – banjara Jun 18 '13 at 09:57
  • 1
    The combiner is run within the `MapTask`, so one combiner per mapper that gets called multiple times. – Thomas Jungblut Jun 18 '13 at 10:25
  • @zuxqoj I guess you misunderstood your resources. dataworld states that it is not guaranteed to run for each key. That means that the combine operation can take place for a key twice or more times if the keys were emitted in different spills of the buffer. Yahoo simply states that a combiner runs within the mapper- although their flow-charts seem confusing. – Thomas Jungblut Jun 18 '13 at 10:33
4

First of all, Thomas Jungblut's answer is great and I gave me upvote. The only thing I want to add is that the Combiner will always be run at least once per Mapper if defined, unless the mapper output is empty or is a single pair. So having the combiner not being executed in the mapper is possible but highly unlikely.

Lan
  • 6,470
  • 3
  • 26
  • 37
1

Source code which has logic to invoke combiner based on condition.

Line 1950 - Line 1955 https://github.com/apache/hadoop/blob/0b8a7c18ddbe73b356b3c9baf4460659ccaee095/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/MapTask.java

 if (combinerRunner == null || numSpills < minSpillsForCombine) {
     Merger.writeFile(kvIter, writer, reporter, job);
 } else {
     combineCollector.setWriter(writer);
     combinerRunner.combine(kvIter, combineCollector);
 }

So Combiner runs if :

It is not defined , and If the spills are greater than minSpillsForCombine. minSpillForCombine is driven by property "mapreduce.map.combine.minspills" whose default value is 3.

Pradeep Bhadani
  • 4,435
  • 6
  • 29
  • 48