Date selection in large csv time series files

Question

I'm dealing with some large CSV files. Basically I have two for the year 2009 and 2010. I read these both is seperatly using pandas, and then append the 2010 file to the end of the 2009 dataframe.

To do this I use the function:

def import_data():
     with open(file_A, 'r') as f:
         reader = pd.read_csv(f, sep=',', parse_dates=({'Date_Time': [0]}) )
     with open (file_B, 'r') as B:
         reader2= pd.read_csv(B, sep=',', parse_dates=({'Date_Time': [0]}))   
     reader=reader.append(reader2)

     return reader

Basically, I then do some processing, resampling the data. However, all this takes such a long time due to the length of the files.

Is there a way to select certain rows based on defined inputs? e.g. just dates 01/10/2009 - 01/02/2010? Dates are all in the first column of the csv.

I know that this is possible for the columns using use_cols within pandas.read_csv

I'm sure it is just a copy/paste error, but you have an erroneous bracket in your append statement — David Hagan, Feb 05 '14 at 12:39
Possible duplicate question: http://stackoverflow.com/questions/13651117/pandas-filter-lines-on-load-in-read-csv — David Hagan, Feb 05 '14 at 14:12

score 0 · Answer 1 · answered Feb 05 '14 at 12:56

Have you tried making them into iterators?

from itertools import chain

def import_data():
     with open(file_A, 'r') as f:
         reader = pd.read_csv(f, sep=',', parse_dates=({'Date_Time': [0]}), iterator=True)
     with open (file_B, 'r') as B:
         reader2= pd.read_csv(B, sep=',', parse_dates=({'Date_Time': [0]}), iterator=True)
     return chain(reader, reader2)

desired_range = [row for row in import_data() if row >= start_date and row <= end_date]

Date selection in large csv time series files

1 Answers1