Could I rely on mapper's counters in redurers in Hadoop?

Question

Let's consider the case when I change the values of counters in mappers and want to use that information in reducers.

Looks like we have a guarantee that the reduce function won't be called until all mappers are finished. Does this take into consideration the mappers, which are being speculatively executed? Could reducer see irrelevant values due to speculative execution?

score 2 · Accepted Answer · edited May 23 '17 at 12:07

2

The timing of execution of Reducers is determined by the configuration parameter: mapreduce.job.reduce.slowstart.completedmaps (in mapred-site.xml). This is by default set to "0.05". It means, when around 5% of Mappers are completed, the Reducers are scheduled for execution.

You can tweak this parameter to achieve different results. For e.g. setting it to "1.0" will ensure that, the Reducers will be started only after 100% of the Mappers are completed.

Redcuer tasks will start copying the data from the mappers, which have completed the execution. But, the reduce() method will be called, only when the data from all the mappers is copied by the reducer.

This link: When do reduce tasks start in Hadoop?, clearly explains this process.

As for the speculative execution, it gets triggered only in case of Mappers/Reducers, which are lagging behind compared to other Mappers/Reducers. If the same Mapper instance is executed in duplicate, it does not mean counters are also duplicated. Task counters are maintained for each task attempt. If a task attempt fails or killed (due to speculative execution), then counters for that attempt are dropped. So, speculative execution will not have impact on the overall counter value.

One thing you must remember is that, the counter values are definitive only once a job has successfully completed.

edited May 23 '17 at 12:07

Community

1
1

answered Dec 18 '15 at 16:45

Manjunath Ballur

6,287
3
37
48

hm, I beleive that map reduce phase could be started before all mappers finished, for sure partitioning part could be started. But I thougth that my java reduce function won't be called, before all mappers done, before the partitioning part is fully completed. – serg Dec 21 '15 at 09:20
Reducers could start even before all the mappers are completed. Like I have explained in the answer, when the reducers are started is determined by the parameter: mapreduce.job.reduce.slowstart.completedmaps – Manjunath Ballur Dec 21 '15 at 09:41
[this](https://www.cs.rutgers.edu/~pxk/417/notes/content/mapreduce.html) papers says: "Step 5: Reduce: Sort (Shuffle) When all the map workers have completed their work, the master notifies the reduce workers to start working." The mentioned by you parameter says when the Step 4: Map worker: Partition could be started, which in Hadoop is the part of the reduce phase. – serg Dec 21 '15 at 09:44
Its wrong. Please check the answer here: http://stackoverflow.com/questions/11672676/when-do-reduce-tasks-start-in-hadoop . Also to quote from the "Hadoop the Definitive Guide" book: "By default, schedulers wait until 5% of the map tasks in a job have completed before scheduling reduce tasks for the same job. For large jobs, this can cause problems with cluster utilization, since they take up reduce containers while waiting for the map tasks to complete. Setting mapreduce.job.reduce.slowstart.completedmaps to a higher value, such as 0.80 (80%), can help improve throughput." – Manjunath Ballur Dec 21 '15 at 09:47
2

What I mean, the reduce tasks can start, even before, all the mappers are completed. But, yes, actual reduction process is started, only when all the mappers done processing. Because, for the reduce phase to give proper output, we need the output of all the mappers to be ready. Till that time, the reducers spend time in copying the data from the completed mappers. Only when the data from all the mappers are copied, the actual reduce will start. – Manjunath Ballur Dec 21 '15 at 09:49
One thing you must remember is that, the counter values are definitive only once a job has successfully completed. – Manjunath Ballur Dec 21 '15 at 11:26

Could I rely on mapper's counters in redurers in Hadoop?

1 Answers1