Hadoop MapReduce Shuffle&Sort: Why need ‘group’ operation?

Question

The 'group' operation of the 'shuffle' is to change the data into <key, List <value>> form, but my reducer.py does not recognize that List and just continues to treat it as a line of <key, value> form of standard input.

look at the code below:

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)

So why does it do this? Is hadoop streaming changing <key, List <value> data from to <key, value> form in the standard input? If so, why need 'group' operation? 'Sort' operation directly to the same key sort together, and then line by line of the input to the reduce.py is not the same?

reducer.py:

import sys

current_word = None
current_count = 0
word = None
for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    try:
        count = int(count)
    except  ValueError:
        continue
    if current_word == word:
        current_count += count
    else:
        if current_word:
            print('%s\t%s' % (current_word, current_count))
        current_word = word
        current_count = count

if current_word == word:
    print('%s\t%s' % (current_word, current_count))

sys.exit(0)

Suppose there is an example of word frequency statistics that counts the number of occurrences of a, b, c, d.

1.With the 'group' operation, the data becomes like:

(b,[2,3])
(c,[1,5])
(d,[3,6])
(a,[2,4])

2.With the 'sort' operation, the data becomes like:

(a,[2,4])
(b,[2,3])
(c,[1,5])
(d,[3,6])

3.reducer.py when receiving data, the data becomes like:

(a,2)
(a,4)
(b,2)
(b,3)
(c,1)
(c,5)
(d,3)
(d,6)

So I want to know who made 2 stage into 3 stage. And If there is no 'group' step:

1.Without the 'group' operation, but with the 'sort' operation, the data also become like:

(a,2)
(a,4)
(b,2)
(b,3)
(c,1)
(c,5)
(d,3)
(d,6)

2.reducer.py receives the above data, is it not OK? I do not understand. :-)

Hard to understand your run on sentences. Are you asking about the "shuffle" stage of mapreduce? — OneCricketeer, Sep 21 '17 at 02:20

OneCricketeer · Answer 1 · 2017-09-21T13:08:09.070

0

The mapper only outputs lists of values (actually Iterator) for the Java API.

Yes, in MapReduce, there is a Shuffle and Sort phase, but in Streaming, the keys are presented to the reducers in a line delimited manner and sorted the keys. With that information, you can detect the boundaries between different keys, thus naturally forming groups, and reducing on those

From the source you copied the code from, you can see the output of the mapper.py | sort -k1,1

The input to the (first) reducer is like this

Just read the code... Nothing is taking out () or [] characters. Nothing is split on a comma... Your mapper is printing out tabs between the key and the value 1.

The first iteration of the reducer will hit this code

current_word = word  # found a new word 
current_count = count  # this will always start at 1 for word count

And the reducer accumulates the sorted keys until it finds a new word and prints the totals of the previous word

edited Sep 21 '17 at 13:08

answered Sep 21 '17 at 02:31

OneCricketeer

179,855
19
132
245

It is grouped by key, but it is not presented to Hadoop Streaming in that format. You are never parsing `[]` characters – OneCricketeer Sep 21 '17 at 02:52
Yes, I would like to ask is this! then who did this work? hadoop-streaming.jar parsed this []? So why do you need a group operation? Finally, the data has become a line of input. – Gary Sep 21 '17 at 03:04
Again, there's no square brackets. You can think about it that way, but it's not there – OneCricketeer Sep 21 '17 at 03:05
The data is not one line. It's streamed through line by line... Many lines have the same key. And you are guaranteed sorted ordering of keys – OneCricketeer Sep 21 '17 at 03:07
Yeah, I know that. It's streamed through line by line. I'm so sorry for my English. Thx! And now I have only one question, why need group operation? Does it have any advantage? :-) – Gary Sep 21 '17 at 03:12
Again, it's a Shuffle and Sort, which happens to group similar keys. How would you process similar data without grouping at all? – OneCricketeer Sep 21 '17 at 03:14
Group operation followed by sort operation, group operation is to find the same key, but no group operation sort operation can also be the same key together, if it is sorted by dictionary, then input data line by line. – Gary Sep 21 '17 at 03:22
Sorting the values 1,2,1,1,1,2,2,2,1 will "group" all ones and twos together... That's all I'm saying – OneCricketeer Sep 21 '17 at 03:24
Thank you for your answer sincerely again. But I really did not understand what you mean, should I did not say clearly, very sorry, I edited my question and added an example that you can explain this example? – Gary Sep 21 '17 at 06:37
"1. With the 'group' operation, the data becomes like"... **No** it doesn't. There is no "group operation". Stop calling it that... https://stackoverflow.com/questions/22141631/what-is-the-purpose-of-shuffling-and-sorting-phase-in-the-reducer-in-map-reduce – OneCricketeer Sep 21 '17 at 12:59

Hadoop MapReduce Shuffle&Sort: Why need ‘group’ operation?

1 Answers1