3

I'm getting confused after reading below article on Hadoop- Definitive guide 4th edition(page-204)

  • Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to.

  • Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort.

  • Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.

Here is my doubt:

1) Who will execute first combiner or partitions !!

2) When custom combiner and custom partitions will be there so how and what will be the execution steps hierarchy ?

3) Can we feed compress data (avro ,sequence ..etc) to Custom combiner ,if yes then how!!

Looking for a brief and in-depth explanation!!

Thanks in advance.

Prashant
  • 132
  • 1
  • 13

5 Answers5

3

1/ The response is already specified in this part: "Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort."

So firstly the partitions are created in-memory, if there is a custom combiner, it will be executed in-memory, and the result will be spilled to disk at the end.

2/ custom combiner and custom partition will be there when they are specified on the driver class.

job.setCombinerClass(MyCombiner.class);
job.setPartitionerClass(MyPartitioner.class);

If there is no custom combiner specified, so there is no combiner executed. If there is no custom partitioner specified, so the default executed partitioner is "HashPartitioner" (please see the page 221 for that).

3/ Yes, it is possible. Don't forget that the mechanism of the combiner is the same than the reducer. The reducer can consume compressed data. If the consumer consumes the compressed data, that means that the input files format is compressed. for that, you can specify on the driver class the instruction:

Sequence File case: job.setInputFormatClass(SequenceFileInputFormat.class);
Avro File case: job.setInputFormatClass(AvroKeyInputFormat.class); 
Saad Lamarti
  • 300
  • 1
  • 5
  • 15
  • :I do agree if we will set the custom partition and custom combiner both will execute , so in that case what about In-Built hash - partitioner and combiner , will they also get a chance to execute or custom component will override the in-built components behaviour?As per below post by ,when its inbuilt partitioner and combiner then first it will partitioned the data after that it will reduce by the combiner as per the Book , while as per you ,when its a custom component combiner and partitioner then combiner comes first then partitioner Please correct me ,if any gap in my understanding – Prashant Aug 20 '15 at 10:37
  • Custom Partitioner will override the default Hash-partitioner if it is available. But if there is no Custom combiner, there is no default combiner which will be executed. For the second point, the order of execution is the partioner on the first position, and the combiner in the second position. – Saad Lamarti Aug 20 '15 at 10:47
  • Thanks for the response and patience @Saad Lamarti . One more question , It means if both custom partitioner and custom combiner are there , It will follow the same flow first custom partitioner then after custom combiner will run, right ? but here in one of the post people claiming differently :( http://stackoverflow.com/questions/22061210/what-runs-first-the-partitioner-or-the-combiner?rq=1 – Prashant Aug 20 '15 at 11:39
  • As I said on my first comment: ".. So firstly the partitions are created in-memory, if there is a custom combiner, it will be executed in-memory, and the result will be spilled to disk at the end." which is present in the design on the other post concerns the partitioner output. And the phase where the partitions are created in-memory don't figure on the design. It is why you have Consumer on first position, and Partitioner on the second position. – Saad Lamarti Aug 20 '15 at 11:47
  • If can I summarize, the order of execution is: Partitionner in-memory execution -> consumer execution -> Partitioner outputs spilled to disk separately. – Saad Lamarti Aug 20 '15 at 11:50
  • Great, your inputs will really helpful for me @ Saad :) – Prashant Aug 20 '15 at 12:31
  • So while feeding the compressed data(like avro..etc) to the custom combiner so it will decompress in-memory then act accordingly , ryt !! – Prashant Aug 20 '15 at 12:36
  • Right, the compressed data will be decompressed in-memory. – Saad Lamarti Aug 20 '15 at 12:56
  • @SaadLamarti The job.setInputFormatClass(SomeClass.class) is used to define the input format for the mapper. The input to the combiner is the output of the mapper task. We need not to define it anywhere. We just set the Combiner class using job.setCombinerClass(CustomCombiner.class). – YoungHobbit Aug 20 '15 at 17:28
  • @abhishekbafna The job.setInputFormatClass defines both the input and the output format for the mapper. The output format you can't define it, what you can define is the type of the key and the value outputs. – Saad Lamarti Aug 20 '15 at 19:39
  • @ Saad and Abhishek : It means that , Mapper input should also be compressed(avro,swquence,..etc) while feeding the mapper compressed(avro,swquence,..etc) output data to the combiner !! As because there is not provision to set the specific file format for the combiner class . correct me , if I'm getting wrong at any point :) – Prashant Aug 21 '15 at 04:41
  • Right, there is no way to override the file format for the combiner. Otherwise you can specify the type for the combiner's input key and value as you want, by the way of the mapper's output type key and value. – Saad Lamarti Aug 21 '15 at 07:35
1

The direct answer to your question is => COMBINER

Details: Combiner can be viewed as mini-reducers in the map phase. They perform a local-reduce on the mapper results before they are distributed further. Once the Combiner functionality is executed, it is then passed on to the Reducer for further work.

where as

Partitioner come into the picture when we are working on more than on Reducer. So, the partitioner decide which reucer is responsible for a particular key. They basically take the Mapper Result(if Combiner is used then Combiner Result) and send it to the responsible Reducer based on the key.

For a better understanding you can refer the following image, which I have taken from Yahoo Developer Tutorial on Hadoop. Figure 4.6: Combiner step inserted into the MapReduce data flow

Here is the tutorial .

Radhey
  • 11
  • 3
  • @ Radhey : Thanks for the response , one point i want to understand which is killing me hard, So when its in-built partitioner and combiner then first it will partitioned the data after that it will reduce by the combiner as per the Book , while as per you ,when its a custom component combiner and partitioner then combiner comes first then partitioner , is it correct ?? Please correct me , if any gap in my understanding – Prashant Aug 20 '15 at 10:35
0

This is the complete MR job flow. Your 1.) and 2.) is answered here.

  1. Mapper reads the data and processes. This output goes to a intermediate output file.
  2. Once mapper finishes all the key, values pairs. The intermediate output is partitioned into 'R' partitions using either default partitioner 'HashPartitioner' or custom partitioner.
  3. Each partitioned file is sorted.
  4. Any optional combiner code is executed on the sorted 'R' partitions. The combiner step is executed only if it is specified.
  5. Reducers reach out to the mappers and pull their appropriate partitioned files.
  6. After all the mapper tasks completed and all the intermediate data is copied to all the reducers. The reducers perform one more sort on the data.
  7. Then reducers work on their individual key, value pairs one by one.

Answer-3: Yes, combiner can process the compressed data. The combiner function runs on the output of the map phase and is used as a filtering or an aggregating step to lessen the number of intermediate keys that are being passed to the reducer. In most of the cases the reducer class is set to be the combiner class. The difference lies in the output from these classes. The output of the combiner class is the intermediate data that is passed to the reducer whereas the output of the reducer is passed to the output file on disk. The combiner for job can be set like this:

job.setCombinerClass(CustomCombiner.class);
YoungHobbit
  • 13,254
  • 9
  • 50
  • 73
0

Partition runs before the Combinor. a) The mapper will processed the data into b) Followed by a partitioner ( either default or custom ) will partitioned the data as per requirement based on keys. c) Followed by sorting on keys which will be taken care by the background threads/process. d) If combinor exist : Then followed by combinor,This will run on the output of the sorted keys e) Followed by the Reducer which will run sort one more time on the input data followed by the reducer process.

0

I would like to summarize the entire flow:

  1. Mapper reads the data and processes. This output goes to a intermediate output file.
  2. Once mapper finishes all the key, values pairs.
  3. output of Mapper first writen to memory buffer,
  4. when buffer is about to overflow then spilled to local dir and then partitions are created in-memory["Within each partition, the background thread performs an in-memory sort by key and The intermediate output is partitioned into 'R' partitions using either default partitioner 'HashPartitioner' or custom partitioner]
  5. The spilling data is parted according to Partitioner, and in each partition the result is sorted and
  6. if there is a custom combiner, it will be executed in-memory, and the result will be spilled to disk at the end.
  7. Reducers reach out to the mappers and pull their appropriate partitioned files.
  8. After all the mapper tasks completed and all the intermediate data is copied to all the reducers. The reducers perform one more sort on the data.
  9. Then reducers work on their individual key, value pairs one by one.

Please suggest if any gap in my understanding

Pang
  • 9,564
  • 146
  • 81
  • 122