-3

I am working with a large (3.5GB) .csv file that I can't process entirely in memory and want to:

  • Read in one row of data
  • Check if a field is equal to a given criteria
  • If true, append that row to new .csv file. If false, skip to next row
  • Loop until end of original file is reached

The end result would be a separate .csv containing all the rows that matched the criteria.

Steve
  • 89
  • 1
  • 1
  • 7
  • 1
    please provide the code you have already written so we can help you get it right. – TehSphinX Jun 21 '16 at 12:58
  • [Read large text files in Python, line by line without loading it in to memory](http://stackoverflow.com/questions/6475328/read-large-text-files-in-python-line-by-line-without-loading-it-in-to-memory) –  Jun 21 '16 at 12:59
  • Welcome friend. To make your question understandable and "attractive" for people that want to help, please consider posting any code you have tried so far, a few representative lines of the *.csv you are trying to parse, your criteria for getting the line, etc. Put some effort into it.. – Ma0 Jun 21 '16 at 13:00
  • The standard way of accessing CSV files in Python will probably work just fine - why don't you give that a go and see if you get stuck anywhere? – jonrsharpe Jun 21 '16 at 13:00
  • 1
    `pd.read_csv` has a chunk parameter that allows you to read a file part by part. You can increase the chunk size based on your memory. Reading it into a dataframe and then filtering it, and combining in the end might be more efficient then reading line by line. If you provide a small sample it would be easier to illustrate. – ayhan Jun 21 '16 at 13:05

3 Answers3

2

I would do it this way:

chunksize = 10**6 // 2  # you may want to adjust it ... 
for chunk in pd.read_csv(filename, chunksize=chunksize):
    (chunk.query('condition1 & condition2 & ... ')
          .to_csv('output.csv', mode='a', index=False))

The trick is mode='a' parameter which will instruct pandas to add rows instead of overwriting the CSV file...

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
1

Try the code below. Set appropriate number of records (that your machine allows) as chunk size.

chunksize = 10 ** 5
for chunk in pd.read_csv(filename, chunksize=chunksize):
    df=df.append (chunk [chunk.field==criterion])
user3404344
  • 1,707
  • 15
  • 13
1

I found that the following worked for me. Note that this also produced a separate output file for each criteria value in the list.

import csv

# criteria is the column number from the source data to check against the values list
criteria = 4

valueslist= ['criteria1','criteria2','criteria3']

for item in valueslist:

    with open('sourcedata.csv', 'rb') as inf, open(item + '.csv','wb') as outf:
        reader = csv.reader(inf)
        writer = csv.writer(outf)

        # -- Write Headers as first row
        writer.writerow(["column1", "column2", "column3"])
        for row in reader:
            if row[criteria] == item:
                writer.writerow(row)
Steve
  • 89
  • 1
  • 1
  • 7