I have 100 CSV files which all contain similar information from different time periods. I only need to extract certain information from each time period and don't need to store all the data into memory.
Right now I'm using something that looks like:
import pandas as pd
import numpy as np
import glob
average_distance = []
for files in glob.glob("*2013-Jan*"): # Here I'm only looking at one file
data = pd.read_csv(files)
average_distance.append(np.mean(data['DISTANCE']))
rows = data[np.logical_or(data['CANCELLED'] == 1, data['DEP_DEL15'] == 1)]
del data
My question is: is there some way to use a generator to do this, and if so, would this speed up the process allowing me to breeze through 100 CSV files?
I think that this may be on the right track:
def extract_info():
average_distance = []
for files in glob.glob("*20*"):
data = pd.read_csv(files)
average_distance.append(np.mean(data['DISTANCE']))
rows = data[np.logical_or(data['CANCELLED'] == 1, data['DEP_DEL15'] == 1)]
yield rows
cancelled_or_delayed = [month for month in extract_info()]
Thanks!