Context
Python 3.6.3 :: Anaconda custom (64-bit)
mrjob==0.6.2 with no custom configuration
Running locally
I am implementing the basic word count example for a local map reduce job. My mapper maps a 1 to each word in each line of a book from a .txt
file, using a simple regex. The reducer counts the number of ocurrences of each word, i.e. number of 1's grouped to each word.
from mrjob.job import MRJob
import re
WORD_REGEXP = re.compile(r"[\w']+")
class WordCounter(MRJob):
def mapper(self, _, line):
words = WORD_REGEXP.findall(line)
for word in words:
yield word.lower(), 1
def reducer(self, word, times_seen):
yield word, sum(times_seen)
if __name__ == '__main__':
WordCounter.run()
Problem
The output file is correct but the key value pairs are not globally sorted. It seems like the result is only sorted alphabetically in chunks of data.
"customers'" 1
"customizing" 1
"cut" 2
"cycle" 1
"cycles" 1
"d" 10
"dad" 1
"dada" 1
"daily" 3
"damage" 1
"deductible" 6
...
"exchange" 10
"excited" 4
"excitement" 1
"exciting" 4
"executive" 2
"executives" 2
"theft" 1
"their" 122
"them" 166
"theme" 2
"themselves" 16
"then" 59
"there" 144
"they've" 2
...
"anecdotes" 1
"angel" 1
"angie's" 1
"angry" 1
"announce" 2
"announced" 1
"announcement" 3
"announcements" 3
"announcing" 2
...
"patents" 3
"path" 19
"paths" 1
"patterns" 1
"pay" 45
"exercise" 1
"exercises" 1
"exist" 6
"expansion" 1
"expect" 11
"expectation" 3
"expectations" 5
"expected" 4
....
"customer" 41
"customers" 122
"yours" 15
"yourself" 78
"youth" 1
"zealand" 1
"zero" 7
"zoho" 1
"zone" 2
Question
Is there some initial configuration to be done in order to obtain globally sorted output from an MRJob?