I have question of when shuffling starts.
Let u say i have 2 mappers and 1 reducers. Each mappers will generate output map1 and map2. This map1 and map2 is stored in temporary disk of respective datanode.
Now reducer should wait for both the output of map1 and map2 ? In other-words when does shuffling start? as soon as map1 finishes or it has to wait for map2 to finish as well ?
I am listening to shuffling traffic at reducer and i couldnt find any traffic but console output shows already 70% (approximately) of reducing is finished.
14/12/18 17:45:55 INFO mapred.JobClient: map 97% reduce 22%
14/12/18 17:45:58 INFO mapred.JobClient: map 98% reduce 22%
14/12/18 17:45:59 INFO mapred.JobClient: map 99% reduce 22%
14/12/18 17:46:07 INFO mapred.JobClient: map 100% reduce 22%
14/12/18 17:46:12 INFO mapred.JobClient: map 100% reduce 67%
14/12/18 17:46:15 INFO mapred.JobClient: map 100% reduce 71%
I am seeing shuffling traffic traffic comes in after this point.
I am getting little confused here. What is this approximately 70% of reducer work ? !
Thanks