I have a large csv file (>1M lines), with two columns with dates in European format (DD/MM/YY), that I am processing with pandas. Pandas is slow to read such files. Here are the experiments:
read_csv(...,parse_dates=None)
(default) treats the date columns as string (as expected) and takes < 1 secondsread_csv(..., parse_dates=my_columns, dayfirst=True)
andread_csv(..., parse_dates=my_columns, dayfirst=True, infer_datetime_format=True)
(docs suggested the latter may be faster) both give the same results: it takes ~5 minutes- using a custom date parser
read_csv(..., parse_dates=my_columns, date_parser=lambda x: pd.datetime.strptime(x, '%d/%m/%y'))
as suggested here for weird formatting takes 25 seconds not parsing the dates with
read_csv
but usingto_datetime
afterread_csv(..., parse_dates=None) for column in my_columns: my_frame[column]=pd.to_datetime(my_frame[column], format='%d/%m/%y')
takes 7 seconds
- as reference: opening a csv as saved by pandas (with
DataFrame.to_csv
once all the dates have been properly read) takes 2 seconds
As you can see there are a lot of discrepancies, which, I think, is fairly surprising given that it is a standard date format. So what is the correct way to read European dates?