2

I have a long running MapReduce job with some mappers taking considerably more time than others.

Checking the stats on the web interface, I saw that my combiner also kicked in on the reducers (which where mostly idle as just 2 mappers were still running).

Although it seems reasonable to not waste time and do some pre-aggregation until all mappers have finished, I cannot find any documentation for this behaviour. Can anyone confirm that this is indeed a feature of Hadoop or just displayed wrong on the web interface?

dominik
  • 613
  • 2
  • 6
  • 10

1 Answers1

0

The combiner starts when a reasonable amount of data has been emitted by the mapper. Note that a combiner runs as an aggregation (typically) of a mapper's output (and not on the reduce side). More details can be found here.

Also, the reducers can start gathering (only) the data that are emitted by the mappers, before all the mappers have finished. That is called the shuffling phase of the reducer. You can change the time when the reducers will start gathering data, by changing the mapred.reduce.slowstart.completed.maps property (or mapreduce.job.reduce.slowstart.completedmaps in newer versions). More details on this SO post.

Community
  • 1
  • 1
vefthym
  • 7,422
  • 6
  • 32
  • 58
  • 1
    Thanks for your answer! However, I am aware that combiners run only on mappers according to the docs. However, I saw counters being increased on both columns (mapper AND reducer) of the web ui in the combiner input/output row. Hence the confusion. (this job was running on EMR (2.4.7) with 70 cr1.8xlarge (32-core) instances). – dominik May 06 '15 at 18:49