Does the shuffle and sort phase come before the end of the map task or does it come after the output is generated from the map task so that there is no look back to the map task anymore. This is a 'Map only task' case where I get confusion. If there is no Shuffle and sort in Map only task, can someone explain how is the data written into the final output files.
-
There shouldn't be shuffling in a map-only task. Why do you think there is? – OneCricketeer Mar 06 '17 at 09:09
-
1@cricket_007 I am not saying there is or there will be. Im a bit confused in understanding the concept of shuffle and sort if it comes in Map only task. I added some more info to the question. Hoping it is clear to understand. – Metadata Mar 06 '17 at 09:18
-
The shuffle happens into reducers & combiners, so why would it happen during the map? – OneCricketeer Mar 06 '17 at 09:20
-
And you can run code yourself to see how data is written into output files. Mostly, it'll be numbered `part-m-00xyz` files, for **part**ition of the **m**apper data – OneCricketeer Mar 06 '17 at 09:21
-
@cricket_007 Okay. I will run a program. Appreciate your hints. I can relate what you are saying now. – Metadata Mar 06 '17 at 09:23
1 Answers
When you have a map-only task, there is not shuffling at all, which means that mappers will write the final output directly to the HDFS.
On the other hand, when you have a whole Map-Reduce program, with mappers and reducers, yes, shuffling can start before reduce-phase start.
Quoting this very nice answer in SO:
First of all shuffling is the process of transfering data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn't be able to have any input (or input from every mapper). Shuffling can start even before the map phase has finished, to save some time. That's why you can see a reduce status greater than 0% (but less than 33%) when the map status is not yet 100%.
Hope this answer had clarified your confusion.
-
1Thanks for your kind words :) I just wanted to add that in the case of a map-only job, the output of each mapper will not be sorted, as opposed to the case when there is a reduce phase. In the latter case, mappers will sort their output locally, and the corresponding map outputs required by a reduce task will be merge-sorted in the sort phase. – vefthym Mar 06 '17 at 10:37
-
1@vefthym That is a good point, that is how it works by default. I would like to mention that you can also set some flags in order to disable the sorting process even though there is a reduce phase, this will improve the performance in case you need sorting. – dbustosp Mar 06 '17 at 10:47
-
@dbustosp Thanks for the clarification given. I understand the mechanism clearly now. – Metadata Mar 06 '17 at 12:11
-
So there is internal mechanism which understands that the task at hand is map-only or map-reduce one. – Devi Jan 16 '18 at 11:59
-
What will happen if map-only task with multiple mappers? Is order of input still be kept in output? E.g., if an input is `[1, 2, 3, a, b, c]` and 2 mappers take `[1, 2, 3]` and `[a, b, c]` for each, then is there no possibility for output to be `[1, a, 2, b, 3, c]`? Is the output always `[1, 2, 3, a, b, c]`? – ghchoi Mar 29 '18 at 02:53
-
1Hi @GyuHyeonChoi, if you have a map-only job then the number of output files will be the same as the number of mappers you have. Based on your example, if you have a set of data [1, 2, 3, a, b, c]. The first mapper will process [1, 2, 3] and the second will process [a, b, c]. The output will be 2 files one for each mapper. – dbustosp Mar 29 '18 at 17:11