1

I have 100 CSV files which all contain similar information from different time periods. I only need to extract certain information from each time period and don't need to store all the data into memory.

Right now I'm using something that looks like:

import pandas as pd
import numpy as np
import glob

average_distance = []
for files in glob.glob("*2013-Jan*"): # Here I'm only looking at one file
    data = pd.read_csv(files)

    average_distance.append(np.mean(data['DISTANCE']))
    rows = data[np.logical_or(data['CANCELLED'] == 1, data['DEP_DEL15'] == 1)]

    del data

My question is: is there some way to use a generator to do this, and if so, would this speed up the process allowing me to breeze through 100 CSV files?

I think that this may be on the right track:

def extract_info():
average_distance = []
for files in glob.glob("*20*"):
    data = pd.read_csv(files)

    average_distance.append(np.mean(data['DISTANCE']))
    rows = data[np.logical_or(data['CANCELLED'] == 1, data['DEP_DEL15'] == 1)]

    yield rows

cancelled_or_delayed = [month for month in extract_info()]

Thanks!

marc
  • 2,037
  • 9
  • 24
  • 32

1 Answers1

0

Pandas' read_csv method is highly optimized: you can provide arguments like which columns to read and the chunksize as described here in the docs. So your call could look something like this:

data = pd.read_csv(files, usecols=['Column_X', 'Column_Y'], chunksize=10000)
Felix Zumstein
  • 6,737
  • 1
  • 30
  • 62
  • That looks really convenient, but can I then filter according to a condition in a certain column like I did in my code? – marc Nov 08 '13 at 15:24
  • filtering rows is not built in, but you can use a generator as you were suggesting, see this answer here: http://stackoverflow.com/a/13653490/918626 It would be interesting to see though if you are still running into memory problems just by using 'chunksize' and 'usecols' – Felix Zumstein Nov 08 '13 at 15:58