0

There have been quite a few questions like this one already, with conflicting answers. I've also found conflicting statements in the literature and on blogs. In the book, Hadoop, the Definitive Guide, it says

Hadoop does not provide a guarantee of how many times it will call [the combiner] for a particular map output record, if at all. In other words, calling the combiner function zero, one or many times should produce the same output from the reducer

The answers to a similar question here On what basis mapreduce framework decides whether to launch a combiner or not suggest that a combiner, if defined, will always be called once as the MapOutputBuffer needs to be flushed.

There might be an edge case where the mapper emits only once, meaning the combiner, even if defined, won't run.

My question is this: Is there a definitive source for the answer to this question? I've searched the Hadoop documentation, of course, but can't find anything.

Community
  • 1
  • 1
Kevin
  • 53
  • 6
  • Don't you already have it? The extract from the Definitive guide explains it well. – franklinsijo Apr 13 '17 at 13:36
  • It's just that it contradicts what I've read elsewhere (the linked answer, for example) are those that say the combiner is guaranteed to run wrong? – Kevin Apr 13 '17 at 13:46

1 Answers1

1

Hadoop frameworks is aimed to provide a easy interface to users/developers to develop code which runs in distributed environment without having user/developer to think/handle the complexity of distributed systems.

To answer your question, you can read the source code which has logic to invoke combiner based on condition.

Line 1950 - Line 1955 https://github.com/apache/hadoop/blob/0b8a7c18ddbe73b356b3c9baf4460659ccaee095/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/MapTask.java

 if (combinerRunner == null || numSpills < minSpillsForCombine) {
     Merger.writeFile(kvIter, writer, reporter, job);
 } else {
     combineCollector.setWriter(writer);
     combinerRunner.combine(kvIter, combineCollector);
 }

So Combiner wont run if :

  • It is not defined , or
  • If the spills are less than minSpillsForCombine. minSpillForCombine is driven by property "mapreduce.map.combine.minspills" whose default value is 3.

As most of the hadoop properties are configurable so the behaviour and performance depends on how you configure the properties.

Hope this answers your question.

Pradeep Bhadani
  • 4,435
  • 6
  • 29
  • 48
  • Thanks Pradeep, that is part of the answer, but when the combiner is run on the spill files, that is not the first time it is run. It is also run when data is written to the spill file from memory. Thank you for pointing me to the source code, though. That is a real help. – Kevin Apr 13 '17 at 14:29
  • @kevin your comment is not clear to me. What do you mean when you say "ut when the combiner is run on the spill files, that is not the first time it is run. It is also run when data is written to the spill file from memory." What else you are looking for? you can upvote or accept if you are happy with answer – Pradeep Bhadani Apr 13 '17 at 15:06
  • The combiner is run when data is written from memory to a spill file, but the minSpillsForCombine property controls whether or not the spill file contents are run through the combiner an additional time. So the combiner could still run even if the number of spills is less than minSpillsForCombine. I think, therefore, that your second bullet point is not correct. I'm taking this from page 209 in the Definitive Guide. – Kevin Apr 13 '17 at 15:15
  • "So the combiner could still run even if the number of spills is less than minSpillsForCombine. " can you explain this with example you came across? – Pradeep Bhadani Apr 13 '17 at 15:43
  • Let's say minSpillsForCombine=3 and during the map phase, the memory buffer becomes full. This triggers a spill file to be written, and part of that process is to run the combiner. It has now run once. Then, let's say the map phase completes, so only 1 spill has occurred. The number of spills is not enough to trigger an additional run of the combiner. So, in this situation, the combiner ran even though the number of spills was less than 3. – Kevin Apr 13 '17 at 16:02
  • whats the value of property "mapreduce.map.combine.minspills" in your cluster? – Pradeep Bhadani Apr 18 '17 at 10:50