1

So I found most of the solution to my problem in this thread: Use Python to select rows with a particular range of values in one column

But when implementing the code, I'm coming up with an error that I cannot figure out. I'm trying to extract the rows of data for subscribers only from citi bike data (info here: http://www.citibikenyc.com/system-data)

So here is the code:

import csv

with open("E:/Dropbox/PPS/CitiBikeData/2014_Data.csv") as input, open("E:/Dropbox/PPS/CitiBikeData/subscribers.csv", "w") as output:
   reader = csv.DictReader(input, dialect="excel-tab")
   fieldnames = reader.fieldnames
   writer_output = csv.DictWriter(output, fieldnames, dialect="excel-tab")
   writer_output.writeheader()
   for row in reader:
       if int(row['gender']) > 0:
          writer_output.writerow(row)

And here is the error I'm getting:

C:\Python34\python.exe E:/Dropbox/PPS/CitiBikeData/csvfilter_2.py
Traceback (most recent call last):
  File "E:/Dropbox/PPS/CitiBikeData/csvfilter_2.py", line 9, in <module>
    if int(row['gender']) > 0:
KeyError: 'gender'

Process finished with exit code 1

I understand what a KeyError is (from reading this https://wiki.python.org/moin/KeyError), but I can't figure out why I'm getting the error, or how to fix it.

Community
  • 1
  • 1
Chris
  • 17
  • 1
  • 4
  • Clarification, I used gender in the code, because any non-subscribers are coded as 0, I thought using an integer would be better than a string. – Chris Jul 11 '14 at 17:58
  • what does `print row.keys()` say? – timgeb Jul 11 '14 at 18:00
  • dict_keys(['tripduration,"starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","birth year","gender"']) – Chris Jul 11 '14 at 18:03
  • 2
    The file I downloaded is **not tab delimited**. You have **one key**. Note the single quotes around the key. – Martijn Pieters Jul 11 '14 at 18:03

1 Answers1

3

The data you downloaded is not tab delimited. You are using the wrong CSV dialect to open it.

Remove the dialect parameter, the default (comma separated) is just fine for the format:

>>> import csv
>>> f = open("/tmp/2013-07 - Citi Bike trip data.csv")
>>> reader = csv.DictReader(f)
>>> next(reader)
{'bikeid': '16950', 'tripduration': '634', 'end station longitude': '-73.98165557', 'stoptime': '2013-07-01 00:10:34', 'end station name': '1 Ave & E 15 St', 'gender': '0', 'start station name': 'E 47 St & 2 Ave', 'start station longitude': '-73.97032517', 'start station id': '164', 'start station latitude': '40.75323098', 'end station id': '504', 'starttime': '2013-07-01 00:00:00', 'end station latitude': '40.73221853', 'birth year': '\\N', 'usertype': 'Customer'}
>>> _['gender']
'0'

Since the gender column is either '0' or '1' or '2', in this case you can simply test for not equal to '0' and save yourself an int() call:

writer_output.writerows(row for row in reader if row['gender'] != '0')

This uses a generator expression to pass all filtered rows to DictWriter.writerows() (plural).

alko
  • 46,136
  • 12
  • 94
  • 102
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343