-1

Trying to do practice on using large data on AWS using mapreduce and python.

I have the code

    import sys
    import re
    import csv
    import glob
    import string

    #class MyDialect(csv.Dialect):
        #strict = True
        #skipinitialspace = False
        #quoting = QUOTE_MINIMAL
        #delimiter = ','
        #quotechar = '"'

    for line in sys.stdin:
        csv.reader(line, dialect='excel')
        #reader = csv.reader(line, delimiter=',', quoting=csv.QUOTE_ALL,  quotechar='"')
        #line = line.strip()
        #unpacked = line.split(",")
        try:
        #regular expresion 
          num,title,year,length,budget,rating,votes,r1,r2,r3,r4,r5,r6,r7,r8,r9,r10,mpaa,Action,Animation,Comedy,Drama,Documentary,Romance,Short = line.split(",")
          if float(rating) <= 1:
            results = [votes, rating, title, year]
            print("\t".join(results))
        except ValueError:
          pass

Now I know this isn't perfect its outputing the line value, however whenever I try to use the csv on the line I get .

<_csv.reader object at 0x7fc2c184e280>

for all my lines.

I need to get the input as a line, and the output it to std out as this is one node processing the data and passing it to the reducer. I have most of the bugs worked out, however it doesn't accept titles with a comma in them. so "Blair witch, the" would be skipped and not shown in the list as I believe the budget becomes the rating and the rating the votes.

Any idea on how to do this?

Sean Sullivan
  • 329
  • 3
  • 6
  • 17

2 Answers2

0

csv.reader takes an opened file as its argument and will return you a reader object which will iterate over lines in the given csvfile. Since you didn't handle the return value of csv.reader, it will print the lines you saw. Therefore you need to use a variable to store the reader and then iterate use that reader object if you want to use this module or just delete that line if you want to manually parse the file. See the document for detail: https://docs.python.org/2/library/csv.html

Haochen Wu
  • 1,753
  • 1
  • 17
  • 24
  • I did try to do "readline = csv.reader(line)" and print readline, but I got an object back, is there a way to pass those to a variable? – Sean Sullivan Apr 27 '15 at 17:44
  • csv.reader is supposed to take care of a set of csv lines. You can not use it to process one single line. The object it returns can be iterated and each iteration will give you a list that contains the elements in one line of the original csv file. Here is an example of how to use it over stdin:http://stackoverflow.com/questions/6556078/how-to-read-a-csv-file-from-a-stream-and-process-each-line-as-it-is-written – Haochen Wu Apr 27 '15 at 18:10
-1

Ok found the simpler way of doing all this. If you are the admin you have control of the data, instead of doing "," as the deliminator, use tabs, then you won't have issues with the commas anyplace. Most database information doesn't have tabs, unless there is a lot of text.

Know your data, work the program around the data in tandem.

Sean Sullivan
  • 329
  • 3
  • 6
  • 17