2

I have a dataframe with two columns, each one formed by a set of dates. I want to compute the difference between dates and return the the number of days. However, the process (described above) is very slow. Does anyone knows how to accelerate the process? This code is being used in a big file and speed is important.

dfx = pd.DataFrame([[datetime(2014,1,1), datetime(2014,1,10)],[datetime(2014,1,1), datetime(2015,1,10)],[datetime(2013,1,1),  datetime(2014,1,12)]], columns = ['x', 'y'])

enter image description here

dfx['diffx'] = dfx['y']-dfx['x']
dfx['diff'] = dfx['diffx'].apply(lambda x: x.days)
dfx

Final goal:

enter image description here

jpp
  • 159,742
  • 34
  • 281
  • 339
nunodsousa
  • 2,635
  • 4
  • 27
  • 49
  • Surely there's a better way than apply for this – Mad Physicist Sep 11 '18 at 11:01
  • Have you tried cast to timestamp, do the difference and then convert to days? – Jorge Lavín Sep 11 '18 at 11:03
  • What do you mean it is _too slow_? How much time does it take and how much time do you expect it to take? –  Sep 11 '18 at 11:03
  • @JorgeLavín, Pandas already does this.. `df.dtypes` will show you `np.datetime64`. – jpp Sep 11 '18 at 11:03
  • @jpp not sure about the dupe since the OP already has working code and is looking purely for a speed improvement. Nothing is benchmarked in that code. – roganjosh Sep 11 '18 at 11:04
  • 1
    @roganjosh, Sure, I've reopened. Just hope now answers here are (a) actually faster than that dupe, (b) explain clearly why they're faster. I suspect one or the other won't be fulfilled! – jpp Sep 11 '18 at 11:05
  • 1
    @roganjosh you have a point. However, the OP should say how exactly it is too slow for them. Maybe the speed they're getting is already very good performance, there is not much we can do about it... Speed is always relative to a measure. –  Sep 11 '18 at 11:06
  • @jpp I just put some numbers under the answer. `dt.days` is losing :/ – roganjosh Sep 11 '18 at 11:06
  • @jpp there could be value in a standalone question specifically answered based on benchmarks if all else fails? – roganjosh Sep 11 '18 at 11:11
  • Which line is slow, `dfx['diffx'] = dfx['y']-dfx['x']` or `dfx['diff'] = dfx['diffx'].apply(lambda x: x.days)`? Could it be that pandas is loading some timedelta libraries the first time you call ` dfx['y']-dfx['x']`? – Dan Sep 11 '18 at 11:12

1 Answers1

2

You may find a marginal massive speed-up dropping down to NumPy, bypassing the overhead associated with pd.Series objects.

See also pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?.

# Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3

def days_lambda(dfx):
    return (dfx['y']-dfx['x']).apply(lambda x: x.days)

def days_pd(dfx):
    return (dfx['y']-dfx['x']).dt.days

def days_np(dfx):
    return (dfx['y'].values-dfx['x'].values) / np.timedelta64(1, 'D')

# check results are identical
assert (days_lambda(dfx).values == days_pd(dfx).values).all()
assert (days_lambda(dfx).values == days_np(dfx)).all()

dfx = pd.concat([dfx]*100000)

%timeit days_lambda(dfx)  # 5.02 s per loop
%timeit days_pd(dfx)      # 5.6 s per loop
%timeit days_np(dfx)      # 4.72 ms per loop
jpp
  • 159,742
  • 34
  • 281
  • 339