How can I check each row of a large csv and write the row if a certain column contains a certain value?

Question

I am working with a large (3.5GB) .csv file that I can't process entirely in memory and want to:

Read in one row of data
Check if a field is equal to a given criteria
If true, append that row to new .csv file. If false, skip to next row
Loop until end of original file is reached

The end result would be a separate .csv containing all the rows that matched the criteria.

please provide the code you have already written so we can help you get it right. — TehSphinX, Jun 21 '16 at 12:58
[Read large text files in Python, line by line without loading it in to memory](http://stackoverflow.com/questions/6475328/read-large-text-files-in-python-line-by-line-without-loading-it-in-to-memory) — , Jun 21 '16 at 12:59
Welcome friend. To make your question understandable and "attractive" for people that want to help, please consider posting any code you have tried so far, a few representative lines of the *.csv you are trying to parse, your criteria for getting the line, etc. Put some effort into it.. — Ma0, Jun 21 '16 at 13:00
The standard way of accessing CSV files in Python will probably work just fine - why don't you give that a go and see if you get stuck anywhere? — jonrsharpe, Jun 21 '16 at 13:00
`pd.read_csv` has a chunk parameter that allows you to read a file part by part. You can increase the chunk size based on your memory. Reading it into a dataframe and then filtering it, and combining in the end might be more efficient then reading line by line. If you provide a small sample it would be easier to illustrate. — ayhan, Jun 21 '16 at 13:05

MaxU - stand with Ukraine · Answer 1 · 2016-06-21T14:06:46.550

2

I would do it this way:

chunksize = 10**6 // 2  # you may want to adjust it ... 
for chunk in pd.read_csv(filename, chunksize=chunksize):
    (chunk.query('condition1 & condition2 & ... ')
          .to_csv('output.csv', mode='a', index=False))

The trick is mode='a' parameter which will instruct pandas to add rows instead of overwriting the CSV file...

edited Jun 21 '16 at 14:06

answered Jun 21 '16 at 13:39

MaxU - stand with Ukraine

205,989
36
386
419

score 1 · Answer 2 · answered Jun 21 '16 at 13:17

1

Try the code below. Set appropriate number of records (that your machine allows) as chunk size.

chunksize = 10 ** 5
for chunk in pd.read_csv(filename, chunksize=chunksize):
    df=df.append (chunk [chunk.field==criterion])

answered Jun 21 '16 at 13:17

user3404344

1,707
15
13

score 1 · Accepted Answer · answered Jun 28 '16 at 15:08

I found that the following worked for me. Note that this also produced a separate output file for each criteria value in the list.

import csv

# criteria is the column number from the source data to check against the values list
criteria = 4

valueslist= ['criteria1','criteria2','criteria3']

for item in valueslist:

    with open('sourcedata.csv', 'rb') as inf, open(item + '.csv','wb') as outf:
        reader = csv.reader(inf)
        writer = csv.writer(outf)

        # -- Write Headers as first row
        writer.writerow(["column1", "column2", "column3"])
        for row in reader:
            if row[criteria] == item:
                writer.writerow(row)

How can I check each row of a large csv and write the row if a certain column contains a certain value?

3 Answers3