The 'group' operation of the 'shuffle' is to change the data into <key, List <value>>
form, but my reducer.py does not recognize that List and just continues to treat it as a line of <key, value>
form of standard input.
look at the code below:
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
So why does it do this? Is hadoop streaming changing <key, List <value>
data from to <key, value>
form in the standard input? If so, why need 'group' operation? 'Sort' operation directly to the same key sort together, and then line by line of the input to the reduce.py is not the same?
reducer.py:
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print('%s\t%s' % (current_word, current_count))
current_word = word
current_count = count
if current_word == word:
print('%s\t%s' % (current_word, current_count))
sys.exit(0)
Suppose there is an example of word frequency statistics that counts the number of occurrences of a, b, c, d.
1.With the 'group' operation, the data becomes like:
(b,[2,3])
(c,[1,5])
(d,[3,6])
(a,[2,4])
2.With the 'sort' operation, the data becomes like:
(a,[2,4])
(b,[2,3])
(c,[1,5])
(d,[3,6])
3.reducer.py when receiving data, the data becomes like:
(a,2)
(a,4)
(b,2)
(b,3)
(c,1)
(c,5)
(d,3)
(d,6)
So I want to know who made 2 stage into 3 stage. And If there is no 'group' step:
1.Without the 'group' operation, but with the 'sort' operation, the data also become like:
(a,2)
(a,4)
(b,2)
(b,3)
(c,1)
(c,5)
(d,3)
(d,6)
2.reducer.py receives the above data, is it not OK? I do not understand. :-)