Just to review how map-reduce works:
In the word count example that you cited, the map reads the split/section as you mentioned.
While scanning through the section of words, the map doesn't perform the occurrence count,
what the map is doing is creating a key-value pair of <"word",1>
. This simplifies the downstream aggregation of the words by the reducer.
The map is doing that so that the reducer which handles that handles that particular "word"
could collect all the <"word",1>
tuples sent its way and then generate the count by adding all the 1s together.
In short, lets say you have a list of words as follows:
cat
rat
mat
bat
cat
sat
bat
Lets say we have 3 mappers that handle the file split as follows:
Split1 for mapper1:
cat
rat
mat
Split2 for mapper2:
bat
cat
Split3 for mapper3:
sat
bat
The mapper1 will emit:
<cat,1>
<rat,1>
<mat,1>
Mapper2 will emit:
<bat,1>
<cat,1>
Mapper3 will emit:
<sat,1>
<bat,1>
Although the reality is a little more complex but ideally, you have one reducer for each word and they receive the tuples from each of the mappers.
So reducer for cat receives:<cat,1> , <cat,1>
The reducer for rat receives: <rat,1>
The reducer for mat receives: <mat,1>
The reducer for bat receives: <bat,1>,<bat,1>
The reducer for sat receives: <sat,1>
Each of the reducer adds up all the tuples that it has received and gets an aggregate value as follows:
<cat,2>
<rat,1>
<mat,1>
<bat,2>
<sat,1>
That's how map-reduce implements the word-count. The idea is to parallelize the count operation.
As far as your question about sorting goes, it is more of a "bucketing" trick than a "merge". The map-reduce framework would internally sort the data and stream it to the reducer in sorted order.
Please check this post for more details.