0

I'm sure I'm doing something dumb here, but here goes. I'm working on a class assignment for my Udacity class "Intro to Map Reduce and Hadoop". Our assignment is to make a mapper/reducer that will count occurrences of a word across our data set (the body of forum posts). I've got an idea of how to do this, but I can't get Python to read in stdin data to the reducer as a dictionary.

Here's my approach thus far: Mapper reads through the data (in this case in the code) and spits out a dictionary of word:count for each forum post:

#!/usr/bin/python
import sys
import csv
import re
from collections import Counter


def mapper():
    reader = csv.reader(sys.stdin, delimiter='\t')
    writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)

    for line in reader:
        body = line[4]
        #Counter(body)
        words = re.findall(r'\w+', body.lower())
        c = Counter(words)
        #print c.items()
        print dict(c)





test_text = """\"\"\t\"\"\t\"\"\t\"\"\t\"This is one sentence sentence\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Also one sentence!\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Hey!\nTwo sentences!\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"One. Two! Three?\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"One Period. Two Sentences\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Three\nlines, one sentence\n\"\t\"\"
"""

# This function allows you to test the mapper with the provided test string
def main():
    import StringIO
    sys.stdin = StringIO.StringIO(test_text)
    mapper()
    sys.stdin = sys.__stdin__

if __name__ == "__main__":
    main()

the count of forum post then goes to stdout like: {'this': 1, 'is': 1, 'one': 1, 'sentence': 2}

then the reducer should read in this stdin as a dictionary

#!/usr/bin/python
import sys
from collections import Counter, defaultdict
for line in sys.stdin.readlines():
    print dict(line)

but that fails, giving me this error message: ValueError: dictionary update sequence element #0 has length 1; 2 is required

Which means (if I understand correctly) that it's reading in each line not as a dict, but as a text string. How can I get python to understand that input line is a dict? I've tried using Counter and defaultdict, but still had the same problem or had it read in each character as an element of list, which is also not what I want.

Ideally, I want the mapper to read in the dict of each line, then add the values of the next line, so after the second line the values are {'this':1,'is':1,'one':2,'sentence':3,'also':1} and so on.

Thanks, JR

jrubins
  • 187
  • 13
  • You should consider reading the string, *then* parsing it. If you do that you just need to search for "parse string as dict python". Maybe [this](http://stackoverflow.com/a/988251/645270)? – keyser Aug 12 '14 at 18:14
  • 1
    A string is not a valid argument for a `dict()` constructor. You'd need to use something like [`ast.literal_eval()`](https://docs.python.org/2/library/ast.html#ast.literal_eval) to parse a Python dictionary from a string. Or possibly serialize and deserialize your data structure using the `json` module. – Lukas Graf Aug 12 '14 at 18:14
  • your ``line`` is a single string value. – Tritium21 Aug 12 '14 at 18:15
  • Why don't you just call `mapper`? That's what the original code is doing. – nneonneo Aug 12 '14 at 18:16
  • I don't really see why this question was downvoted. This might be an odd contraption, but the question itself is just fine, was clearly written with some care and contains everything needed to reproduce the problem. – Lukas Graf Aug 12 '14 at 18:24
  • 1
    Thanks Lukas, keyser. I ended up using the ast.literal_eval method. That solved my problem. I'll post the solution below. It's definitely an odd contraption, Lukas, nneonneo, but the reason is that this is a map/reduce program, so it's designed to be distributed, so I can't just have one program that counts all the values (which would be easy). Instead, I could have multiple mappers counting the values of different sections of the input file, which then get output to multiple reducers (or in this case, one reducer) which sums up all the values that the mappers give it. – jrubins Aug 12 '14 at 18:43

1 Answers1

1

Thanks to @keyser, the ast.literal_eval() method worked for me. Here's what I have now:

#!/usr/bin/python
import sys
from collections import Counter, defaultdict
import ast
lineDict = {}
c = Counter()
for line in sys.stdin.readlines():
    lineDict = ast.literal_eval(line)
    c.update(lineDict)
print c.most_common()
jrubins
  • 187
  • 13