1

I am trying to do a df.apply on date objects but it's too too slow!!

My prun output gives....

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1999   14.563    0.007   14.563    0.007 {pandas.tslib.array_to_timedelta64}
 13998    0.103    0.000   15.221    0.001 series.py:126(__init__)
  9999    0.093    0.000    0.093    0.000 {method 'reduce' of 'numpy.ufunc' objects}
272012    0.093    0.000    0.125    0.000 {isinstance}
  5997    0.089    0.000    0.196    0.000 common.py:199(_isnull_ndarraylike)

So basically it's 14 seconds for a 2000 length array. My actual array size is > 100,000 which translates to a run time of > 15 minutes or maybe more.

It's stupid of pandas to call this function "pandas.tslib.array_to_timedelta64" which is the bottleneck? I really don't understand why this function call is necessary??? Both the operators in subtraction are of same data types. I explicity converted them beforehand using pd.to_datetime() method. And no this conversion time is not included in this calculation.

So in all you can understand my frustration at this pathetic code!!!

actual code looks like this

 df  = pd.DataFrame(bet_endtimes)

def testing():
    close_indices = df.apply(lambda x: np.argmin(np.abs(currentdata['date'] - x[0])),axis=1)
    print close_indices

 %prun testing()
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
coffeequant
  • 425
  • 3
  • 6
  • 19
  • That output looks like yet another dreary pointless knockoff of [`gprof`](http://archive.today/9r927). If you want to know why python takes time, use [*this method.*](http://stackoverflow.com/a/4299378/23771) – Mike Dunlavey Jul 09 '14 at 12:55
  • i dont understand. can u explain a bit? – coffeequant Jul 09 '14 at 13:03
  • The only remotely useful column of data is the second one, `tottime`, and that just looks like "self time". What you need to know is not self time, but inclusive time, and not as an absolute time, but as a percent, and not just of functions, but of the sites where they are called. Also, the number of samples does not need to be large. If your program takes 15 seconds when it should take less than one second, then the odds are 14:1 that a *single stack sample* will show you why it's taking that time. – Mike Dunlavey Jul 09 '14 at 13:10
  • strangely this one works better np.searchsorted(pd.to_datetime(currentdata['date']),bet_endtimes,side='right')-1 kind of less than 10 seconds for whole 100,000 array! – coffeequant Jul 09 '14 at 13:24
  • Can't argue with success! Good luck. – Mike Dunlavey Jul 09 '14 at 13:34

1 Answers1

8

I'd recommend consulting the documentation: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-deltas Its also very helpful to include sample data so I don't have to guess what you are doing.

Using apply is always the last operation to try. Vectorized methods are much faster.

In [55]: pd.set_option('max_rows',10)

In [56]: df = DataFrame(dict(A = pd.date_range('20130101',periods=100000, freq='s')))

In [57]: df
Out[57]: 
                        A
0     2013-01-01 00:00:00
1     2013-01-01 00:00:01
2     2013-01-01 00:00:02
3     2013-01-01 00:00:03
4     2013-01-01 00:00:04
...                   ...
99995 2013-01-02 03:46:35
99996 2013-01-02 03:46:36
99997 2013-01-02 03:46:37
99998 2013-01-02 03:46:38
99999 2013-01-02 03:46:39

[100000 rows x 1 columns]

In [58]:  (df['A']-df.loc[10,'A']).abs()
Out[58]: 
0   00:00:10
1   00:00:09
2   00:00:08
...
99997   1 days, 03:46:27
99998   1 days, 03:46:28
99999   1 days, 03:46:29
Name: A, Length: 100000, dtype: timedelta64[ns]

In [59]: %timeit  (df['A']-df.loc[10,'A']).abs()
1000 loops, best of 3: 1.47 ms per loop

When you contribute to pandas, you can name methods.

It's stupid of pandas to call this function "pandas.tslib.array_to_timedelta64" which is the bottleneck? time is not included in this calculation.

Brad Larson
  • 170,088
  • 45
  • 397
  • 571
Jeff
  • 125,376
  • 21
  • 220
  • 187