Calling MapReduce Twice

Question

I'm following the word count tutorial here: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0

and I can produce how often a word appears in this format:

word frequency
1    1
2    2
3    3
4    1
5    2
6    1

However, now I need to group the frequency like this:

frequency   count
1           3
2           2
3           1

Basically, for each frequency, find out how often that appeared. How would I modify the code to show this? I feel like I have to modify IntSumReducer but I've never really worked with Hadoop.

score 1 · Accepted Answer · answered Apr 03 '17 at 17:21

1

Instead of modifying SumReducer from example, you should create new job altogether that works off of output of word count program.

Your Mapper will need to output frequency as key and integer 1 as value. You can write your own reducer or just use the same reducer used in example.

answered Apr 03 '17 at 17:21

alpeshpandya

492
3
12

Do I need a Mapper and Reducer? – user1883614 Apr 03 '17 at 17:24
Yes. But As I mentioned in the answer, you can use the example reducer and just need custom mapper. – alpeshpandya Apr 03 '17 at 17:56

score 0 · Answer 2 · edited Oct 18 '21 at 03:07

we have to write a mapper function in such a way that it works with the output of the word count program.

map(line):
a=extract 2nd column from the wordcount output
for each frequency in a:
emit<frequency,1>

now reduce in such a way that for same frequency add all of them in a list from the above example: (<1,[1,1,1]> <2,[1,1]> <3,[1]>)

reduce(key, list):
sum=0
for each value in list:
sum+=value
emit<key, sum>

Calling MapReduce Twice

2 Answers2

Linked