0

I'm dealing with some large CSV files. Basically I have two for the year 2009 and 2010. I read these both is seperatly using pandas, and then append the 2010 file to the end of the 2009 dataframe.

To do this I use the function:

def import_data():
     with open(file_A, 'r') as f:
         reader = pd.read_csv(f, sep=',', parse_dates=({'Date_Time': [0]}) )
     with open (file_B, 'r') as B:
         reader2= pd.read_csv(B, sep=',', parse_dates=({'Date_Time': [0]}))   
     reader=reader.append(reader2)

     return reader

Basically, I then do some processing, resampling the data. However, all this takes such a long time due to the length of the files.

Is there a way to select certain rows based on defined inputs? e.g. just dates 01/10/2009 - 01/02/2010? Dates are all in the first column of the csv.

I know that this is possible for the columns using use_cols within pandas.read_csv

user2761786
  • 251
  • 1
  • 4
  • 11

1 Answers1

0

Have you tried making them into iterators?

from itertools import chain

def import_data():
     with open(file_A, 'r') as f:
         reader = pd.read_csv(f, sep=',', parse_dates=({'Date_Time': [0]}), iterator=True)
     with open (file_B, 'r') as B:
         reader2= pd.read_csv(B, sep=',', parse_dates=({'Date_Time': [0]}), iterator=True)
     return chain(reader, reader2)

desired_range = [row for row in import_data() if row >= start_date and row <= end_date]
CasualDemon
  • 5,790
  • 2
  • 21
  • 39