0

I am parsing a CSV file which is quite large. I am only interested in 2 of the rows (the ones with header Ccy1 and Ccy2).

So far my approach is to parse the whole file, and any fields that arent in the list of "approved" fields get removed from the list.

I tried this on a small sample file with only 3 rows and it worked fine. When I parsed the real file which has 107 rows more are left than just the "approved" fields.

Why is it not removing all values that are not in the list.

This is my current script:

import csv
data = csv.reader(open('real_sample.csv'))
fields = data.next()
ccy_fields = ['Ccy1', 'Ccy2']

print 'fields: ' + str(fields)
print 'fields to keep: ' + str(ccy_fields)

for item in fields:
    if str(item) not in ccy_fields:
         fields.remove(item)

print "fields: " + str(fields)
mhawke
  • 84,695
  • 9
  • 117
  • 138
Joe Smart
  • 751
  • 3
  • 10
  • 28
  • To start, your indentation in the for loop is off. Not sure if that was just a copy-paste error or not! – cowsock Aug 07 '15 at 13:58
  • 4
    You're removing items from a list you are currently iterating over which is a bad practice. See my answer [here](http://stackoverflow.com/questions/31704066/floats-not-evaluating-as-negative-python/31704332#31704332) for an explanation why. – kylieCatt Aug 07 '15 at 14:00

3 Answers3

2

You are modifying the list that is being iterated over by removing items from the same list in the body of the loop. That's the cause of your problem.

I suggest that a list comprehension is a better way to do it:

fields = [item for item in fields if item in ccy_fields]

Also, the csv module returns data of type string for each field, so there is no need to convert with str().

When removing items from a list that is being iterated over you will typically see that the item immediately following the removed item will be skipped. When you tested with only 3 columns, the correct result would probably be seen if there were 2 columns in ccy_fields and one that was not. When scaling up to 100+ items there would be fields eligible for removal that were skipped.

To solve your problem requires that the indices of the columns to be retained be determined, and then used to filter out the other columns:

import csv
ccy_fields = ['Ccy1', 'Ccy2']

with open('real_sample.csv') as f:
    reader = csv.reader(f)
    headers = next(reader)
    indices = [i for i,field in enumerate(headers) if field in ccy_fields]
    data = [[row[i] for i in indices] for row in reader]

Following this, data will contain all of the rows with only the desired columns.

mhawke
  • 84,695
  • 9
  • 117
  • 138
1

You need to take a copy of the list and iterate over it first, or the iteration will fail with what may be unexpected results.

for item in fields:
    if str(item) not in ccy_fields:
        fields.remove(item)    
#replace by
fields = [item for item in fields if str(item) in ccy_fields]

related questions: Remove items from a list while iterating in Python

Community
  • 1
  • 1
luoluo
  • 5,353
  • 3
  • 30
  • 41
0

You might want to consider just taking the fields you want directly as you read the file instead of taking all the data and then trimming it. For example:

import csv
data   = csv.reader(open('real_sample.csv'))
wanted = []

for line in data:  # loop over the data without reading all of it into memory
    if ('Ccy1' in line or'Ccy2' in line):
        wanted.append(line)  # just keep the data when it matches you criteria
isosceleswheel
  • 1,516
  • 12
  • 20