5

I'm new to hadoop and mapreduce. Could someone clarify the difference between a combiner and an in-mapper combiner or are they the same thing?

Billy02
  • 51
  • 1
  • 2

1 Answers1

6

You are probably already aware that a combiner is a process that runs locally on each Mapper machine to pre-aggregate data before it is shuffled across the network to the various cluster Reducers.

The in-mapper combiner takes this optimization a bit further: the aggregations do not even write to local disk: they occur in-memory in the Mapper itself.

The in-mapper combiner does this by taking advantage of the setup() and cleanup() methods of

org.apache.hadoop.mapreduce.Mapper

to create an in-memory map along the following lines:

Map<LongWritable, Text> inmemMap = null
   protected void setup(Mapper.Context context) throws IOException, InterruptedException {
   inmemMap  = new Map<LongWritable, Text>();
 }

Then during each map() invocation you add values to than in memory map (instead of calling context.write() on each value. Finally the Map/Reduce framework will automatically call:

protected void cleanup(Mapper.Context context) throws IOException, InterruptedException {
  for (LongWritable key : inmemMap.keySet()) {
      Text myAggregatedText = doAggregation(inmemMap.get(key))// do some aggregation on 
                   the inmemMap.     
      context.write(key, myAggregatedText);
  }
}

Notice that instead of calling context.write() every time, you add entries to the in-memory map. Then in the cleanup() method you call context.write() but with the condensed/pre-aggregated results from your in-memory map . Therefore your local map output spool files (that will be read by the reducers) will be much smaller.

In both cases - both in memory and external combiner - you gain the benefits of less network traffic to the reducers due to smaller map spool files. That also decreases the reducer processing.

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
  • 1
    This is right, but I add something else: The Combiner could be not run by Hadoop. In-mapper combiner always be run, because is included in the map function. The combiner is only a optimization factor for Hadoop. The best comments for this I read in [this book](http://lintool.github.io/MapReduceAlgorithms/), section 3.1. – Tuxman Jan 30 '15 at 21:31
  • I don't understand you. I read [this](http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201107.mbox/%3C374D8F3F-B8B1-499F-BEDB-BFEE3219010C@hortonworks.com%3E), and I understand if you set the combiner, this step can or can't be run, depend of some other parameters handling for Hadoop in the runtime. I'm not a expert, therefore I appreciate some reference (in code or article) where explain in more details your point. – Tuxman Jan 30 '15 at 23:55
  • @Tuxman I just realized that I had commented on the wrong Answer (i have another active answer about Hadoop Streaming with combiner). So yes you are correct in this case that the Combiner is only optional to be run. – WestCoastProjects Jan 31 '15 at 00:05