2

I have a large csv file (>1M lines), with two columns with dates in European format (DD/MM/YY), that I am processing with pandas. Pandas is slow to read such files. Here are the experiments:

  • read_csv(...,parse_dates=None) (default) treats the date columns as string (as expected) and takes < 1 seconds
  • read_csv(..., parse_dates=my_columns, dayfirst=True) and read_csv(..., parse_dates=my_columns, dayfirst=True, infer_datetime_format=True) (docs suggested the latter may be faster) both give the same results: it takes ~5 minutes
  • using a custom date parser read_csv(..., parse_dates=my_columns, date_parser=lambda x: pd.datetime.strptime(x, '%d/%m/%y')) as suggested here for weird formatting takes 25 seconds
  • not parsing the dates with read_csv but using to_datetime after

    read_csv(..., parse_dates=None)
    for column in my_columns:
      my_frame[column]=pd.to_datetime(my_frame[column], format='%d/%m/%y')
    

    takes 7 seconds

  • as reference: opening a csv as saved by pandas (with DataFrame.to_csv once all the dates have been properly read) takes 2 seconds

As you can see there are a lot of discrepancies, which, I think, is fairly surprising given that it is a standard date format. So what is the correct way to read European dates?

Phik
  • 414
  • 3
  • 15
  • 1
    Does it work? Yes? You have your answer :) – cs95 Jul 24 '17 at 17:43
  • Sure but there has to be a better way for such a standard date format – Phik Jul 24 '17 at 17:57
  • 1
    The advantage of `to_datetime` is that you specify the format. It's also very slow if you don't specify the format. There is an open issue about passing the date format to read_csv [here](https://github.com/pandas-dev/pandas/issues/2586). – ayhan Jul 24 '17 at 19:39
  • Thank you @ayhan! – Phik Jul 25 '17 at 15:00

0 Answers0