pandas read_csv slow with European date format

Question

I have a large csv file (>1M lines), with two columns with dates in European format (DD/MM/YY), that I am processing with pandas. Pandas is slow to read such files. Here are the experiments:

read_csv(...,parse_dates=None) (default) treats the date columns as string (as expected) and takes < 1 seconds
read_csv(..., parse_dates=my_columns, dayfirst=True) and read_csv(..., parse_dates=my_columns, dayfirst=True, infer_datetime_format=True) (docs suggested the latter may be faster) both give the same results: it takes ~5 minutes
using a custom date parser read_csv(..., parse_dates=my_columns, date_parser=lambda x: pd.datetime.strptime(x, '%d/%m/%y')) as suggested here for weird formatting takes 25 seconds

not parsing the dates with read_csv but using to_datetime after

read_csv(..., parse_dates=None)
for column in my_columns:
  my_frame[column]=pd.to_datetime(my_frame[column], format='%d/%m/%y')

takes 7 seconds

as reference: opening a csv as saved by pandas (with DataFrame.to_csv once all the dates have been properly read) takes 2 seconds

As you can see there are a lot of discrepancies, which, I think, is fairly surprising given that it is a standard date format. So what is the correct way to read European dates?

Sure but there has to be a better way for such a standard date format — Phik, Jul 24 '17 at 17:57
The advantage of `to_datetime` is that you specify the format. It's also very slow if you don't specify the format. There is an open issue about passing the date format to read_csv [here](https://github.com/pandas-dev/pandas/issues/2586). — ayhan, Jul 24 '17 at 19:39

pandas read_csv slow with European date format

0 Answers0