Conversion of a timedelta to int very slow in python

Question

I have a dataframe with two columns, each one formed by a set of dates. I want to compute the difference between dates and return the the number of days. However, the process (described above) is very slow. Does anyone knows how to accelerate the process? This code is being used in a big file and speed is important.

dfx = pd.DataFrame([[datetime(2014,1,1), datetime(2014,1,10)],[datetime(2014,1,1), datetime(2015,1,10)],[datetime(2013,1,1),  datetime(2014,1,12)]], columns = ['x', 'y'])

dfx['diffx'] = dfx['y']-dfx['x']
dfx['diff'] = dfx['diffx'].apply(lambda x: x.days)
dfx

Final goal:

Have you tried cast to timestamp, do the difference and then convert to days? — Jorge Lavín, Sep 11 '18 at 11:03
What do you mean it is _too slow_? How much time does it take and how much time do you expect it to take? — , Sep 11 '18 at 11:03
@JorgeLavín, Pandas already does this.. `df.dtypes` will show you `np.datetime64`. — jpp, Sep 11 '18 at 11:03
@jpp not sure about the dupe since the OP already has working code and is looking purely for a speed improvement. Nothing is benchmarked in that code. — roganjosh, Sep 11 '18 at 11:04
@roganjosh, Sure, I've reopened. Just hope now answers here are (a) actually faster than that dupe, (b) explain clearly why they're faster. I suspect one or the other won't be fulfilled! — jpp, Sep 11 '18 at 11:05
@roganjosh you have a point. However, the OP should say how exactly it is too slow for them. Maybe the speed they're getting is already very good performance, there is not much we can do about it... Speed is always relative to a measure. — , Sep 11 '18 at 11:06
@jpp I just put some numbers under the answer. `dt.days` is losing :/ — roganjosh, Sep 11 '18 at 11:06
@jpp there could be value in a standalone question specifically answered based on benchmarks if all else fails? — roganjosh, Sep 11 '18 at 11:11
Which line is slow, `dfx['diffx'] = dfx['y']-dfx['x']` or `dfx['diff'] = dfx['diffx'].apply(lambda x: x.days)`? Could it be that pandas is loading some timedelta libraries the first time you call ` dfx['y']-dfx['x']`? — Dan, Sep 11 '18 at 11:12

jpp · Accepted Answer · 2018-09-11T11:32:36.263

You may find a ~~marginal~~ massive speed-up dropping down to NumPy, bypassing the overhead associated with pd.Series objects.

See also pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?.

# Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3

def days_lambda(dfx):
    return (dfx['y']-dfx['x']).apply(lambda x: x.days)

def days_pd(dfx):
    return (dfx['y']-dfx['x']).dt.days

def days_np(dfx):
    return (dfx['y'].values-dfx['x'].values) / np.timedelta64(1, 'D')

# check results are identical
assert (days_lambda(dfx).values == days_pd(dfx).values).all()
assert (days_lambda(dfx).values == days_np(dfx)).all()

dfx = pd.concat([dfx]*100000)

%timeit days_lambda(dfx)  # 5.02 s per loop
%timeit days_pd(dfx)      # 5.6 s per loop
%timeit days_np(dfx)      # 4.72 ms per loop

Conversion of a timedelta to int very slow in python

1 Answers1

Linked