Do we really need sorting in the MapReduce framework?

Question

I am completely new to MapReduce and just can't get my mind around the need to sort the mapper output according to the keys in each partition. Eventually all we want is that a reducer is fed a partition which consists of several pairs of <key,List of Values> and that the key in each pair is unique not just for the corresponding partition but all the partitions which are fed to different reducers.

For doing that what is the need to do a sort at any stage whatsoever. Can't we use a hash table to group the values corresponding to the same key?

To break it down for each stage. At the mapper stage, for each output pair we simply hash the key to find the partition number and then we append the corresponding pair to a linked list of all such pairs belonging to the same partition. So at the end, the output obtained by a single mapper would be a hashtable. In which for each partition number we have a linked list of <key,value> pairs with no key based order whatsoever i.e. no locality for similar key values .

Then the partitions from different mapper tasks are shuffled to a reducer. We now need to make sure that we first group all the values corresponding to the same key (a kind of a merge) and then feed those merged pairs of <key,List of Values> to a separate reducer function . Here again we can use a hashtable to do the same, we simply iterate through all the partition and for each key map them to an index in the hashtable and append the corresponding value to the linked list in the hashtable. Wouldn't this method save more time as compare to the one in which we sort the output of each mapper?

I have already gone through the link (I currently can't comment on the thread , so I wrote a separate question.) The top answer mentions that

Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply. Each reduce task takes a list of key-value pairs, but it has to call the reduce() method which takes a key-list(value) input, so it has to group values by key. It's easy to do so, if input data is pre-sorted (locally) in the map phase and simply merge-sorted in the reduce phase (since the reducers get data from many mappers)

But again we can do the same by using a hash table or can we not?

score 3 · Answer 1 · answered Jun 05 '17 at 19:34

Well, yeah, you could use a hash table as long as everything fits in memory. But once the amount of data you're working with exceeds your computer's memory capacity, you have a problem.

The solution is to output data to a disk file and do an external sort.

Do we really need sorting in the MapReduce framework?

1 Answers1