0

I am trying to add months and days based on conditions and I am hung up on DateOffsets passing in as a series. As you see below I used 0 as an index for date offsets, but this will give me the wrong answer.

I would prefer a vectorized solution if possible.

I am only using zero because I don't know how to access the internal row index enumeration of this np.select function. I am open to better ways to do this as well if this doesn't seem to be the best method.

This is similar to this question, but it doesn't outline how to do this with conditions. Add months to a date in Pandas

import pandas as pd
import numpy as np

# build dataframe
d = {
    "order_type": ["New", "Renewal", "New"], 
    "ship_date": ["2011-01-15 00:00:00", "2011-01-10 00:00:00", "2011-01-10 00:00:00" ], 
    "expiring_date": ["2011-01-25 00:00:00", "2011-01-20 00:00:00", "2011-01-20 00:00:00"], 
    "contract_term": [12, 24, 36],
    "grace_period": [45, 0, 30]}
df = pd.DataFrame(data=d)


# establish date conditions
conditions = [
                ((df["order_type"] == "New") & (df["grace_period"] == 0)), # 1 
                ((df["order_type"] == "New") & (df["grace_period"] != 0)), # 2
                df["order_type"] == "Renewal" # 3
                ]
choices = [

                pd.to_datetime(df["ship_date"]) + pd.DateOffset(months=df["contract_term"].iloc[0]), # 1
                pd.to_datetime(df["ship_date"]) + pd.DateOffset(months=df["contract_term"].iloc[0]) + pd.DateOffset(days=df["grace_period"].iloc[0]), # 2
                df["expiring_date"] # 3
                ]
# return calculated end date
df["end_date"] = np.select(conditions, choices, default=pd.NaT)

# convert to datetime
df["end_date"] = pd.to_datetime(df["end_date"])

# print dataframe
df

Current incorrect output

index order_type ship_date expiring_date contract_term grace_period end_date
0 New 2011-01-15 00:00:00 2011-01-25 00:00:00 12 45 2012-02-29
1 Renewal 2011-01-10 00:00:00 2011-01-20 00:00:00 24 0 2011-01-20
2 New 2011-01-10 00:00:00 2011-01-20 00:00:00 36 30 2012-02-24

Expected Output

Notice when index = 2 the end_date is correctly calculated

index order_type ship_date expiring_date contract_term grace_period end_date
0 New 2011-01-15 00:00:00 2011-01-25 00:00:00 12 45 2012-02-29
1 Renewal 2011-01-10 00:00:00 2011-01-20 00:00:00 24 0 2011-01-20
2 New 2011-01-10 00:00:00 2011-01-20 00:00:00 36 30 2014-02-09
Joe Rivera
  • 307
  • 2
  • 11
  • perhaps you can express more directly _the intent_? It's hard to understand the logic from the code you provide (especially since it doesn't work as you expected). – Pierre D Jul 12 '21 at 22:19

2 Answers2

0

One issue you'll be facing is adding a variable DateOffset is not vectorized. It will work, but you'll get a PerformanceWarning. Here below, we suppress that warning (but of course that doesn't fix the warning).

This is a well known problem, see e.g. here. You may however ignore it if you just have a few thousands rows.

Edit 2: I figured out a vectorized solution (that I put as a standalone answer to a more directly related question here). This now takes 170ms for 300K rows (a speedup of ~450x over the direct use of pd.offsets.DateOffset).

Updated solution 2

def vadd_offset(dates, months, days):
    ddt = dates.dt
    m = ddt.month - 1 + months
    mb = pd.to_datetime(pd.DataFrame({
        'year': ddt.year + m // 12,
        'month': (m % 12) + 1,
        'day': 1})) + (dates - dates.dt.normalize())
    me = mb + pd.offsets.MonthEnd()
    r = mb + (ddt.day - 1) * pd.Timedelta(days=1)
    r = np.minimum(r, me)
    r = r + days * pd.Timedelta(days=1)
    return r

Speed

n = int(300_000)
dates = pd.Series(
    pd.to_datetime(np.random.randint(
        pd.Timestamp('2000').value,
        pd.Timestamp('2020').value,
        n
    ))
).dt.floor('1s')
months = pd.Series(np.random.randint(0, 120, n))
days = pd.Series(np.random.randint(0, 60, n))

%%time
new = vadd_offset(dates, months, d0)
# CPU times: user 166 ms, sys: 315 µs, total: 166 ms

%%time
sol1 = add_offset(dates, months, d0)  # from updated solution 1
# CPU times: user 7.43 s, sys: 132 ms, total: 7.56 s

Edit 1: the initial version was unduly slow (10x slower than the updated), and it is subtly wrong in some cases (e.g. 2015-07-29 + 59 months should be 2020-06-29, not 2020-06-28). The updated version is still not vectorized, but is faster than using a lambda function and a few other things I tried. It takes roughly 25ms per thousand rows. On 300K rows (all 'New'), it takes ~7.5s.

Updated solution 1

import warnings


def add_offset(dates, months, days):
    with warnings.catch_warnings():
        warnings.simplefilter(action='ignore',
                              category=pd.errors.PerformanceWarning)
        return dates + np.array([
            pd.offsets.DateOffset(months=m, days=d)
            for m, d in zip(months, days)
        ])

Usage:

# build dataframe
df = pd.DataFrame({
    'order_type': ['New', 'Renewal', 'New'], 
    'ship_date': pd.to_datetime(['2011-01-15', '2011-01-10', '2011-01-10']), 
    'expiring_date': pd.to_datetime(['2011-01-25', '2011-01-20', '2011-01-20']), 
    'contract_term': [12, 24, 36],
    'grace_period': [45, 0, 30]
})

isnew = df['order_type'] == 'New'
df['end_date'] = df['ship_date'].where(isnew, df['expiring_date'])
df.loc[isnew, 'end_date'] = add_offset(
    dates=df.loc[isnew, 'end_date'],
    months=df.loc[isnew, 'contract_term'],
    days=df.loc[isnew, 'grace_period']
)

>>> df
  order_type  ship_date expiring_date  contract_term  grace_period   end_date
0        New 2011-01-15    2011-01-25             12            45 2012-02-29
1    Renewal 2011-01-10    2011-01-20             24             0 2011-01-20
2        New 2011-01-10    2011-01-20             36            30 2014-02-09

Speed

n = int(300_000)
dates = pd.Series(
    pd.to_datetime(np.random.randint(
        pd.Timestamp('2000').value,
        pd.Timestamp('2020').value,
        n
    ))
).dt.floor('1s')
months = pd.Series(np.random.randint(0, 120, n))
days = pd.Series(np.random.randint(0, 60, n))

%%time
a = add_offset(dates, months, days)
# CPU times: user 7.43 s, sys: 132 ms, total: 7.56 s

Original answer:

import warnings

isnew = df['order_type'] == 'New'
df['end_date'] = df['ship_date'].where(isnew, df['expiring_date'])
with warnings.catch_warnings():
    warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
    df.loc[isnew, 'end_date'] += df.loc[isnew, 'contract_term'] * pd.DateOffset(months=1)
    df.loc[isnew, 'end_date'] += df.loc[isnew, 'grace_period'] * pd.DateOffset(days=1)
Pierre D
  • 24,012
  • 7
  • 60
  • 96
  • Hi Pierre thanks for this. You mentioned DateOffsets is not vectorized any thoughts on how to vectorize this? – Joe Rivera Jul 13 '21 at 02:03
  • Not that I am aware of. There is logic to apply that depends on the date (`2020-01-31` + `1 month` --> `2020-02-29`). But see the updated answer. 7.5s for 300K rows. – Pierre D Jul 13 '21 at 16:19
  • ok, I figured out a way to do this in a vectorized way. Please see "Edit 2" at the beginning of my answer. Now 170ms for 300K rows. – Pierre D Jul 13 '21 at 17:53
0

I am providing a DateOffset solution that is not vectorized that I worked out.

This solution took about 87 seconds for 300k records. Definitely interested to see if someone can find a vectorized approach to this one.

import pandas as pd
import numpy as np
import time

# build dataframe
d = {
    "order_type": ["New", "Renewal", "New"], 
    "ship_date": ["2011-01-15 00:00:00", "2011-01-10 00:00:00", "2011-01-10 00:00:00" ], 
    "expiring_date": ["2011-01-25 00:00:00", "2011-01-20 00:00:00", "2011-01-20 00:00:00"], 
    "contract_term": [12, 24, 36],
    "grace_period": [45, 0, 30]}
df = pd.DataFrame(data=d)

# multiply dataframe to 300k records
df = pd.concat([df]*10000, ignore_index=True)

# time it
start_time=time.time()

# establish date conditions
conditions = [
                ((df["order_type"] == "New") & (df["grace_period"] == 0)), # 1 
                ((df["order_type"] == "New") & (df["grace_period"] != 0)), # 2
                df["order_type"] == "Renewal" # 3
                ]
choices = [

                df.apply(lambda x: pd.to_datetime(x['ship_date']) + pd.DateOffset(months=x['contract_term']), axis=1), # 1
                df.apply(lambda x: pd.to_datetime(x['ship_date']) + pd.DateOffset(months=x['contract_term']) + pd.DateOffset(days=x['grace_period']), axis=1), # 2
                df["expiring_date"] # 3
                ]

df["end_date"] = np.select(conditions, choices, default=pd.NaT)

end_time=time.time()
print('Apply lambda DateOffset solution {} seconds'.format(end_time-start_time))

Joe Rivera
  • 307
  • 2
  • 11