0

Let's say I want to find how many times each word appears inside some text.

My understanding was that the text is separated into sections, and each section is passed to map. map would then get the word occurrences for each section and pass the result to reduce, like this:

for each word w in document:
    occurrences[w] += 1

return occurrences

However, according to the MapReduce paper and wikipedia, map would simply emit 1 for each word, like this:

for each word w in document:
    emit(w, 1)

Isn't this basically just the same thing as passing the text section to reduce directly since it's gonna have to iterate through each word anyways?

Also, just to make sure. If I want to sort a large array with MapReduce, would map sort it's part of the array, and then reduce would merge the sorted arrays, like in mergesort?

Community
  • 1
  • 1
mzee99
  • 181
  • 1
  • 2
  • 12

2 Answers2

1

Just to review how map-reduce works:

In the word count example that you cited, the map reads the split/section as you mentioned.

While scanning through the section of words, the map doesn't perform the occurrence count, what the map is doing is creating a key-value pair of <"word",1>. This simplifies the downstream aggregation of the words by the reducer.

The map is doing that so that the reducer which handles that handles that particular "word" could collect all the <"word",1> tuples sent its way and then generate the count by adding all the 1s together.

In short, lets say you have a list of words as follows:

cat
rat
mat
bat
cat
sat
bat

Lets say we have 3 mappers that handle the file split as follows:

Split1 for mapper1:

cat
rat
mat

Split2 for mapper2:

bat
cat

Split3 for mapper3:

sat
bat

The mapper1 will emit:

<cat,1>
<rat,1>
<mat,1>

Mapper2 will emit:

<bat,1>
<cat,1>

Mapper3 will emit:

<sat,1>
<bat,1>

Although the reality is a little more complex but ideally, you have one reducer for each word and they receive the tuples from each of the mappers.

So reducer for cat receives:<cat,1> , <cat,1>
The reducer for rat receives: <rat,1>
The reducer for mat receives: <mat,1>
The reducer for bat receives: <bat,1>,<bat,1>
The reducer for sat receives: <sat,1>

Each of the reducer adds up all the tuples that it has received and gets an aggregate value as follows:

<cat,2>
<rat,1>
<mat,1>
<bat,2>
<sat,1>

That's how map-reduce implements the word-count. The idea is to parallelize the count operation.

As far as your question about sorting goes, it is more of a "bucketing" trick than a "merge". The map-reduce framework would internally sort the data and stream it to the reducer in sorted order.

Please check this post for more details.

sc_ray
  • 7,803
  • 11
  • 63
  • 100
  • Hey, let's say split1 contained `cat`, `cat`, `mat`. Why wouldn't I configure map to emit ``, `` rather than ``, ``, ``? Thanks – mzee99 Nov 02 '15 at 04:56
  • 1
    @mzee99 - You can. But that would require your map to hold on to state for each of your words. Combiners(https://hadooptutorial.wikispaces.com/Custom+combiner) are typically used to perform the step that you mentioned. – sc_ray Nov 02 '15 at 05:01
  • Is there any difference in terms of efficiency one way or the other? – mzee99 Nov 02 '15 at 05:04
  • Use same class for Combiner & Reducer – Ravindra babu Nov 02 '15 at 05:59
0

If you Mapper want to do Reducer job by emitting , use Combiner, which is semi reducer. Combiner works on output of Mapper and does a reducer job Here.

If you implement Customer Partitioner, Shuffler and Reducer : It will be more effective.

Partitioner will make sure that reducers are load balanced.

Shuffle will make sure that a particular key form Mapper is emitted to a particular reducer.

Combiner will do mini reducer job and combined output of Mapper.

Sorting will sort all values of Mapper output before reaching Reducer.

In Combiner cases, most of the times Combiner & Reducer classes will be set as same classes.

Even with combiner, the output will be w,[1,1] instead of w,[ 2]

//Set Combiner class as WordcounReducer class.
job.setCombinerClass(WordcountReducer.class);
job.setReducerClass(WordcountReducer.class);

Have a look at detailed example and this SE question and this one SE Question 2

Community
  • 1
  • 1
Ravindra babu
  • 37,698
  • 11
  • 250
  • 211