Trying to do practice on using large data on AWS using mapreduce and python.
I have the code
import sys
import re
import csv
import glob
import string
#class MyDialect(csv.Dialect):
#strict = True
#skipinitialspace = False
#quoting = QUOTE_MINIMAL
#delimiter = ','
#quotechar = '"'
for line in sys.stdin:
csv.reader(line, dialect='excel')
#reader = csv.reader(line, delimiter=',', quoting=csv.QUOTE_ALL, quotechar='"')
#line = line.strip()
#unpacked = line.split(",")
try:
#regular expresion
num,title,year,length,budget,rating,votes,r1,r2,r3,r4,r5,r6,r7,r8,r9,r10,mpaa,Action,Animation,Comedy,Drama,Documentary,Romance,Short = line.split(",")
if float(rating) <= 1:
results = [votes, rating, title, year]
print("\t".join(results))
except ValueError:
pass
Now I know this isn't perfect its outputing the line value, however whenever I try to use the csv on the line I get .
<_csv.reader object at 0x7fc2c184e280>
for all my lines.
I need to get the input as a line, and the output it to std out as this is one node processing the data and passing it to the reducer. I have most of the bugs worked out, however it doesn't accept titles with a comma in them. so "Blair witch, the" would be skipped and not shown in the list as I believe the budget becomes the rating and the rating the votes.
Any idea on how to do this?