59

I have a .csv file in such format

timestmp, p
2014/12/31 00:31:01:9200, 0.7
2014/12/31 00:31:12:1700, 1.9
...

and when read via pd.read_csv and convert the time str to datetime using pd.to_datetime, the performance drops dramatically. Here is a minimal example.

import re
import pandas as pd

d = '2014-12-12 01:02:03.0030'
c = re.sub('-', '/', d)

%timeit pd.to_datetime(d)
%timeit pd.to_datetime(c)
%timeit pd.to_datetime(c, format="%Y/%m/%d %H:%M:%S.%f")

and the performances are:

10000 loops, best of 3: 62.4 µs per loop
10000 loops, best of 3: 181 µs per loop
10000 loops, best of 3: 82.9 µs per loop

so, how could I improve the performance of pd.to_datetime when reading date from a csv file?

liubenyuan
  • 733
  • 1
  • 5
  • 10

3 Answers3

76

This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).

As you have shown above, you can improve the performance by supplying a format string to to_datetime. Or another option is to use infer_datetime_format=True


Apparently, the infer_datetime_format cannot infer when there are microseconds. With an example without those, you can see a large speed-up:

In [28]: d = '2014-12-24 01:02:03'

In [29]: c = re.sub('-', '/', d)

In [30]: s_c = pd.Series([c]*10000)

In [31]: %timeit pd.to_datetime(s_c)
1 loops, best of 3: 1.14 s per loop

In [32]: %timeit pd.to_datetime(s_c, infer_datetime_format=True)
10 loops, best of 3: 105 ms per loop

In [33]: %timeit pd.to_datetime(s_c, format="%Y/%m/%d %H:%M:%S")
10 loops, best of 3: 99.5 ms per loop
joris
  • 133,120
  • 36
  • 247
  • 202
  • `infer_datetime_format=True` is slower, the results are `1000 loops, best of 3: 908 µs per loop`. Is there any faster way of converting non-standard time strings to datetime ? – liubenyuan Aug 16 '15 at 12:53
  • 1
    That is because you are timing on a single string. If you do it for a larger series, you will see the speedup. – joris Aug 16 '15 at 13:24
  • 2
    And as I said, there *is* a faster way, the one you used yourself in the question: provide a `format` string. – joris Aug 16 '15 at 13:30
  • Thanks ! one more question, will I manually `e = re.sub('/', '-', c)` and apply `to_datetime(e)` that can improve the performance ? – liubenyuan Aug 16 '15 at 15:24
  • I don't think so. If you can do that, it means you know for sure what the format is, and then it is better to supply a `format` string. Supplying an ISO formatted datestring is a bit faster to parse, but that will not outweigh the time to replace the `/` with `-` – joris Aug 16 '15 at 16:17
  • On my machine, using the format string yielded a 20x speed-up in date parsing for a 50k row dataframe (4000 ms => 180ms). Definitely worth using the format string if you know it ahead of time. – Swaraj Feb 14 '16 at 02:07
8

This question has already been sufficiently answered, but I wanted to add in the results of some tests I was running to optimize my own code.

I was getting this format from an API: "Wed Feb 08 17:58:56 +0000 2017".

Using the default pd.to_datetime(SERIES) with an implicit conversion, it was taking over an hour to process roughly 20 million rows (depending on how much free memory I had).

That said, I tested three different conversions:

# explicit conversion of essential information only -- parse dt str: concat
def format_datetime_1(dt_series):

    def get_split_date(strdt):
        split_date = strdt.split()
        str_date = split_date[1] + ' ' + split_date[2] + ' ' + split_date[5] + ' ' + split_date[3]
        return str_date

    dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%b %d %Y %H:%M:%S')

    return dt_series

# explicit conversion of what datetime considers "essential date representation" -- parse dt str: del then join
def format_datetime_2(dt_series):

    def get_split_date(strdt):
        split_date = strdt.split()
        del split_date[4]
        str_date = ' '.join(str(s) for s in split_date)
        return str_date

    dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%c')

    return dt_series

# explicit conversion of what datetime considers "essential date representation" -- parse dt str: concat
def format_datetime_3(dt_series):

    def get_split_date(strdt):
        split_date = strdt.split()
        str_date = split_date[0] + ' ' + split_date[1] + ' ' + split_date[2] + ' ' + split_date[3] + ' ' + split_date[5]
        return str_date

    dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%c')

    return dt_series

# implicit conversion
def format_datetime_baseline(dt_series):

    return pd.to_datetime(dt_series)

This was the results:

# sample of 250k rows
dt_series_sample = df['created_at'][:250000]

%timeit format_datetime_1(dt_series_sample)        # best of 3: 1.56 s per loop
%timeit format_datetime_2(dt_series_sample)        # best of 3: 2.09 s per loop
%timeit format_datetime_3(dt_series_sample)        # best of 3: 1.72 s per loop
%timeit format_datetime_baseline(dt_series_sample) # best of 3: 1min 9s per loop

The first test results in an impressive 97.7% runtime reduction!

Somewhat surprisingly, it looks like even the "appropriate representation" takes longer, probably because it is semi-implicit.

Conclusion: the more explicit you are, the faster it will run.

Zach
  • 1,243
  • 5
  • 19
  • 28
  • This benchmark is pretty useless without considering `pd.to_datetime(dt_series, infer_datetime_format=True)` If your strings are all the same format, this can be a huge speedup over the default `infer_datetime_format=False` – C8H10N4O2 Jan 22 '18 at 18:07
  • 1
    Well, I wouldn't say it's useless, but you're right that would've been a good addition (I actually thought that was the default, which would've been covered in the baseline function). Maybe you can add that as an answer? – Zach Jan 23 '18 at 19:35
4

Often I am unable to specify a standard date format ahead of time because I simply do not know how each client will choose to submit it. The dates are unpredictably formatted and often missing.

In these cases, instead of using pd.to_datetime, I have found it more efficient to write my own wrapper to dateutil.parser.parse:

import pandas as pd
from dateutil.parser import parse
import numpy as np

def parseDateStr(s):
    if s != '':
        try:
            return np.datetime64(parse(s))
        except ValueError:
            return np.datetime64('NaT')
    else: return np.datetime64('NaT')             

# Example data:
someSeries=pd.Series(  ['NotADate','','1-APR-16']*10000 )

# Compare times:
%timeit pd.to_datetime(someSeries, errors='coerce') #1 loop, best of 3: 1.78 s per loop
%timeit someSeries.apply(parseDateStr)              #1 loop, best of 3: 904 ms per loop

# The approaches return identical results:
someSeries.apply(parseDateStr).equals(pd.to_datetime(someSeries, errors='coerce')) # True

In this case the runtime is cut in half, but YMMV.

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134