0

I have question of when shuffling starts.

Let u say i have 2 mappers and 1 reducers. Each mappers will generate output map1 and map2. This map1 and map2 is stored in temporary disk of respective datanode.

Now reducer should wait for both the output of map1 and map2 ? In other-words when does shuffling start? as soon as map1 finishes or it has to wait for map2 to finish as well ?

I am listening to shuffling traffic at reducer and i couldnt find any traffic but console output shows already 70% (approximately) of reducing is finished.

14/12/18 17:45:55 INFO mapred.JobClient:  map 97% reduce 22%
14/12/18 17:45:58 INFO mapred.JobClient:  map 98% reduce 22%
14/12/18 17:45:59 INFO mapred.JobClient:  map 99% reduce 22%
14/12/18 17:46:07 INFO mapred.JobClient:  map 100% reduce 22%
14/12/18 17:46:12 INFO mapred.JobClient:  map 100% reduce 67%
14/12/18 17:46:15 INFO mapred.JobClient:  map 100% reduce 71%

I am seeing shuffling traffic traffic comes in after this point.

I am getting little confused here. What is this approximately 70% of reducer work ? !

Thanks

navaz
  • 125
  • 1
  • 2
  • 15
  • Take a look at this SF Question: http://stackoverflow.com/questions/11672676/when-do-reduce-tasks-start-in-hadoop – Ashrith Dec 19 '14 at 06:03

3 Answers3

1

In your reducer... First 33% is copy phase, then next 33% is shuffle and sort phase and then final 33% is your actual reduce operation.

I will try to explain a simple flow: After the map task is completed, the output of map task is to be copied where reduce tasks are suppose to happen. Map and Reduce happens doesnt happen in the same machine.. When some mappers are completed you will notice some increment in reduce phase, even before the full map phase has happened.. It is the data outputted by those completed map tasks which are being copied. The map tasks which are completed can be now copied where reduce tasks are bound to happen.. Shuffling only starts after full map phase is over.. This is because, the output keys are to be sorted.. and you cannot sort until you have the full keyspace.. right..??

frazman
  • 32,081
  • 75
  • 184
  • 269
  • Thank You. What is 33% copy phase? copy to where ? Shuffling starts only after all mapping finished ? – navaz Dec 19 '14 at 01:22
  • "Copy phase" refers to the moving of output data from your mappers to the proper reducer. Depending on your configuration, this starts when the mappers are nearing completion. shuffle-and-sort then starts after the mappers are done. – Nonnib Dec 19 '14 at 08:55
  • I am little confused here between copy and shuffle phase. @Fraz I think shuffle phase is nothing but moving data from mappers to reducers. If so what does that initial 33% ? – navaz Dec 19 '14 at 15:14
  • No... copy phase is the movement of data from map nodes to where reduce nodes are.. in your mapper.. you specify the input key, value and output key, value.. This output key value is copied to reduce nodes.. After that, the data is shuffled and sorted based on the map output key.. so that all the same keys are grouped together.. which is the input to your reduce phase – frazman Dec 19 '14 at 20:32
1

Actually sort happen in both map and reduce sides. It is clearly explained in the Definitive guide

Tom Sebastian
  • 3,373
  • 5
  • 29
  • 54
1

The Shuffle and Sort phases together is called "Copy" phase. The sort is done in RAM. If the external sort is needed due to memory lack, the merge sort will happen. So we write sort/merge.

Actually, each Map task has 3 phases: Map, Partitioning, Sort/Merge. Each Reduce task has 3 phases: Shuffle, Sort/Merge, Reduce.

In Hadoop, the shuffle phase starts when 5% of all map task outputs are generated. In this strategy, although the shuffle phase starts earlier to mitigate the job execution time, but it causes the repetitive merges and more disk accesses in reduce side which again results in prolonging job execution time.

Kalle Richter
  • 8,008
  • 26
  • 77
  • 177