Given
import pandas as pd
from datetime import datetime, timedelta
from dateutil.parser import parse as parse_date
start = parse_date("Jan 1 2021")
records = []
for i in range(15):
records.append(dict(date=start+timedelta(days=i), t=i))
df = pd.DataFrame(records).set_index('date')
events = pd.DataFrame([dict(date=parse_date('Jan 5 2021')), dict(date=parse_date('Jan 12 2021'))]).set_index('date')
print(df)
t
date
2021-01-01 0
2021-01-02 1
2021-01-03 2
2021-01-04 3
2021-01-05 4
2021-01-06 5
2021-01-07 6
2021-01-08 7
2021-01-09 8
2021-01-10 9
2021-01-11 10
2021-01-12 11
2021-01-13 12
2021-01-14 13
2021-01-15 14
print(events)
Empty DataFrame
Columns: []
Index: [2021-01-05 00:00:00, 2021-01-12 00:00:00]
I want to somehow join these Dataframes so that df has a new column 'days_since_event'
Like so
t days_since_event
date
2021-01-01 0 -4
2021-01-02 1 -3
2021-01-03 2 -2
2021-01-04 3 -1
2021-01-05 4 0
2021-01-06 5 1
2021-01-07 6 2
2021-01-08 7 3
2021-01-09 8 4
2021-01-10 9 5
2021-01-11 10 6
2021-01-12 11 0
2021-01-13 12 1
2021-01-14 13 2
2021-01-15 14 3
I don't see any obvious vectorized way of doing this.
I thought maybe setting up a column with -1, doing a reverse cumulative sum on that column, plus some other magic but I haven't come up with a solution yet.
EDIT 1: I have a solution based on Pandas dataframe - running sum with reset
df['reset'] = 0
df['val'] = -1
df.loc[df.index > events.index[0], 'val'] = 1
df.loc[df.index.isin(events.index), 'val'] = 0
df.loc[df.index.isin(events.index), 'reset'] = 1
df['cumsum'] = df['reset'].cumsum()
df['days_since_event'] = df.groupby(['cumsum'])['val'].cumsum()
df.drop(['reset', 'cumsum', 'val'], axis=1, inplace=True)
but that makes my head hurt. I don't really like the df.index > events.index[0]
part. Maybe there's a better solution that I'm missing