I'm sure I'm doing something dumb here, but here goes. I'm working on a class assignment for my Udacity class "Intro to Map Reduce and Hadoop". Our assignment is to make a mapper/reducer that will count occurrences of a word across our data set (the body of forum posts). I've got an idea of how to do this, but I can't get Python to read in stdin data to the reducer as a dictionary.
Here's my approach thus far: Mapper reads through the data (in this case in the code) and spits out a dictionary of word:count for each forum post:
#!/usr/bin/python
import sys
import csv
import re
from collections import Counter
def mapper():
reader = csv.reader(sys.stdin, delimiter='\t')
writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
for line in reader:
body = line[4]
#Counter(body)
words = re.findall(r'\w+', body.lower())
c = Counter(words)
#print c.items()
print dict(c)
test_text = """\"\"\t\"\"\t\"\"\t\"\"\t\"This is one sentence sentence\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Also one sentence!\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Hey!\nTwo sentences!\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"One. Two! Three?\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"One Period. Two Sentences\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Three\nlines, one sentence\n\"\t\"\"
"""
# This function allows you to test the mapper with the provided test string
def main():
import StringIO
sys.stdin = StringIO.StringIO(test_text)
mapper()
sys.stdin = sys.__stdin__
if __name__ == "__main__":
main()
the count of forum post then goes to stdout like:
{'this': 1, 'is': 1, 'one': 1, 'sentence': 2}
then the reducer should read in this stdin as a dictionary
#!/usr/bin/python
import sys
from collections import Counter, defaultdict
for line in sys.stdin.readlines():
print dict(line)
but that fails, giving me this error message:
ValueError: dictionary update sequence element #0 has length 1; 2 is required
Which means (if I understand correctly) that it's reading in each line not as a dict, but as a text string. How can I get python to understand that input line is a dict? I've tried using Counter and defaultdict, but still had the same problem or had it read in each character as an element of list, which is also not what I want.
Ideally, I want the mapper to read in the dict of each line, then add the values of the next line, so after the second line the values are {'this':1,'is':1,'one':2,'sentence':3,'also':1}
and so on.
Thanks, JR