1

Context

Python 3.6.3 :: Anaconda custom (64-bit)
mrjob==0.6.2 with no custom configuration
Running locally

I am implementing the basic word count example for a local map reduce job. My mapper maps a 1 to each word in each line of a book from a .txt file, using a simple regex. The reducer counts the number of ocurrences of each word, i.e. number of 1's grouped to each word.

from mrjob.job import MRJob
import re

WORD_REGEXP = re.compile(r"[\w']+")

class WordCounter(MRJob):
  def mapper(self, _, line):
    words = WORD_REGEXP.findall(line)
    for word in words:
      yield word.lower(), 1

  def reducer(self, word, times_seen):
    yield word, sum(times_seen)

if __name__ == '__main__':
  WordCounter.run()

Problem

The output file is correct but the key value pairs are not globally sorted. It seems like the result is only sorted alphabetically in chunks of data.

"customers'"    1
"customizing"   1
"cut"   2
"cycle" 1
"cycles"    1
"d" 10
"dad"   1
"dada"  1
"daily" 3
"damage"    1
"deductible"    6
...
"exchange"  10
"excited"   4
"excitement"    1
"exciting"  4
"executive" 2
"executives"    2
"theft" 1
"their" 122
"them"  166
"theme" 2
"themselves"    16
"then"  59
"there" 144
"they've"   2
...
"anecdotes" 1
"angel" 1
"angie's"   1
"angry" 1
"announce"  2
"announced" 1
"announcement"  3
"announcements" 3
"announcing"    2
...
"patents"   3
"path"  19
"paths" 1
"patterns"  1
"pay"   45
"exercise"  1
"exercises" 1
"exist" 6
"expansion" 1
"expect"    11
"expectation"   3
"expectations"  5
"expected"  4
....
"customer"  41
"customers" 122
"yours" 15
"yourself"  78
"youth" 1
"zealand"   1
"zero"  7
"zoho"  1
"zone"  2

Question

Is there some initial configuration to be done in order to obtain globally sorted output from an MRJob?

ekauffmann
  • 150
  • 10

1 Answers1

-1

You are missing the combiner step, in this guide its the first example of a single-step job: https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html

I'll copy the code for completeness of this answer:

from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")


class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield word.lower(), 1

    def combiner(self, word, counts):
        yield word, sum(counts)

    def reducer(self, word, counts):
        yield word, sum(counts)


if __name__ == '__main__':
    MRWordFreqCount.run()
Emilio Aburto
  • 44
  • 1
  • 3
  • The combiner is an optional phase of a mapreduce job that is used to reduce network congestion. I believe it is not the source of my problem. Anyway, I tried adding a combiner but the result remains unsorted. – ekauffmann May 12 '18 at 17:09
  • Oh right, I'm sorry, seems like the algorithm order the output until certain size is reached and then it's not bothered anymore. Have you tried postprocessing the outputs? What I mean is something like this https://stackoverflow.com/a/5737935 – Emilio Aburto May 22 '18 at 22:15