9

This is quite similar to the question here but I'm wondering if there is a clean way in pandas to make a business day aware TimedeltaIndex? Ultimately I am trying to get the number of business days (no holiday calendar) between a DatetimeIndex and a Timestamp. As per the referenced question, something like this works

import pandas as pd
import numpy as np
drg = pd.date_range('2015-07-31', '2015-08-05', freq='B')
A = [d.date() for d in drg]
B = pd.Timestamp('2015-08-05', 'B').date()
np.busday_count(A, B)

which gives

array([3, 2, 1, 0], dtype=int64)

but this seems a bit kludgy. If I try something like

drg - pd.Timestamp('2015-08-05', 'B')

I get a TimedeltaIndex but the business day frequency is dropped

TimedeltaIndex(['-5 days', '-2 days', '-1 days', '0 days'], dtype='timedelta64[ns]', freq=None)

Just wondering if there is a more elegant way to go about this.

Community
  • 1
  • 1
mgilbert
  • 3,495
  • 4
  • 22
  • 39

1 Answers1

12

TimedeltaIndexes represent fixed spans of time. They can be added to Pandas Timestamps to increment them by fixed amounts. Their behavior is never dependent on whether or not the Timestamp is a business day. The TimedeltaIndex itself is never business-day aware.

Since the ultimate goal is to count the number of days between a DatetimeIndex and a Timestamp, I would look in another direction than conversion to TimedeltaIndex.


Unfortunately, date calculations are rather complicated, and a number of data structures have sprung up to deal with them -- Python datetime.dates, datetime.datetimes, Pandas Timestamps, NumPy datetime64s.

They each have their strengths, but no one of them is good for all purposes. To take advantage of their strengths, it is sometime necessary to convert between these types.

To use np.busday_count you need to convert the DatetimeIndex and Timestamp to some type np.busday_count understands. What you call kludginess is the code required to convert types. There is no way around that assuming we want to use np.busday_count -- and I know of no better tool for this job than np.busday_count.

So, although I don't think there is a more succinct way to count business days than than the method you propose, there is a far more performant way: Convert to datetime64[D]'s instead of Python datetime.date objects:

import pandas as pd
import numpy as np
drg = pd.date_range('2000-07-31', '2015-08-05', freq='B')
timestamp = pd.Timestamp('2015-08-05', 'B')

def using_astype(drg, timestamp):
    A = drg.values.astype('<M8[D]')
    B = timestamp.asm8.astype('<M8[D]')
    return np.busday_count(A, B)

def using_datetimes(drg, timestamp):
    A = [d.date() for d in drg]
    B = pd.Timestamp('2015-08-05', 'B').date()
    return np.busday_count(A, B)

This is over 100x faster for the example above (where len(drg) is close to 4000):

In [88]: %timeit using_astype(drg, timestamp)
10000 loops, best of 3: 95.4 µs per loop

In [89]: %timeit using_datetimes(drg, timestamp)
100 loops, best of 3: 10.3 ms per loop

np.busday_count converts its input to datetime64[D]s anyway, so avoiding this extra conversion to and from datetime.dates is far more efficient.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks, although my application is not that performance intensive my solution was definitely on the slow enough to get me annoyed side so this speed up is much appreciated. Good to know about what's going on under the hood as well. – mgilbert Aug 06 '15 at 12:36
  • I'm a little confused here. If we're just trying to count the number of business days, doesn't `pd.date_range(start, end, freq='B').size` give it to us immediately? No need to use numpy at all. – John Tyree Feb 03 '16 at 04:42
  • 1
    @JohnTyree: The problem is to find the number of business days between a DatetimeIndex and a Timestamp. `pd.date_range(start, end, freq='B').size` finds the number of business days between two dates. You would need to call that in a loop for every day in the DatetimeIndex. If you time `[pd.date_range(drg[i], timestamp, freq='B').size - 1 for i in range(len(drg))]` you'll find it is extremely slow. – unutbu Feb 03 '16 at 09:48
  • Oh I see. I missed the vectorization part. Fair enough. – John Tyree Feb 03 '16 at 21:03
  • i used the date in string format for this and worked just as fast ```[ np.busday_count( str(row.approvedAt).split()[0], str(row.refundedAt).split()[0] ) for row in df.itertuples() ]``` I was trying to find the mean business days between two events – Raveen Beemsingh Sep 26 '19 at 19:37