2

Given

import pandas as pd
from datetime import datetime, timedelta
from dateutil.parser import parse as parse_date

start = parse_date("Jan 1 2021")

records = []
for i in range(15):
    records.append(dict(date=start+timedelta(days=i), t=i))
    
df = pd.DataFrame(records).set_index('date')

events = pd.DataFrame([dict(date=parse_date('Jan 5 2021')), dict(date=parse_date('Jan 12 2021'))]).set_index('date')
print(df)
             t
date          
2021-01-01   0
2021-01-02   1
2021-01-03   2
2021-01-04   3
2021-01-05   4
2021-01-06   5
2021-01-07   6
2021-01-08   7
2021-01-09   8
2021-01-10   9
2021-01-11  10
2021-01-12  11
2021-01-13  12
2021-01-14  13
2021-01-15  14
          

print(events)
Empty DataFrame
Columns: []
Index: [2021-01-05 00:00:00, 2021-01-12 00:00:00]

I want to somehow join these Dataframes so that df has a new column 'days_since_event'

Like so

             t     days_since_event
date          
2021-01-01   0     -4
2021-01-02   1     -3
2021-01-03   2     -2
2021-01-04   3     -1
2021-01-05   4     0
2021-01-06   5     1
2021-01-07   6     2
2021-01-08   7     3
2021-01-09   8     4
2021-01-10   9     5
2021-01-11  10     6
2021-01-12  11     0
2021-01-13  12     1
2021-01-14  13     2 
2021-01-15  14     3

I don't see any obvious vectorized way of doing this.

I thought maybe setting up a column with -1, doing a reverse cumulative sum on that column, plus some other magic but I haven't come up with a solution yet.

EDIT 1: I have a solution based on Pandas dataframe - running sum with reset

df['reset'] = 0
df['val'] = -1
df.loc[df.index > events.index[0], 'val'] = 1
df.loc[df.index.isin(events.index), 'val'] = 0
df.loc[df.index.isin(events.index), 'reset'] = 1
df['cumsum'] = df['reset'].cumsum()
df['days_since_event'] = df.groupby(['cumsum'])['val'].cumsum()
df.drop(['reset', 'cumsum', 'val'], axis=1, inplace=True)

but that makes my head hurt. I don't really like the df.index > events.index[0] part. Maybe there's a better solution that I'm missing

Shaun
  • 3,777
  • 4
  • 25
  • 46

1 Answers1

1

tdy's Answer is definitely a good solution if the data is exactly as in the sample, so if there is a row for each day...

Personally, I would prefer to do it like this:

df = DF(dict(date= [to_datetime("20210101") + to_timedelta(i, unit= "D") for i in range(15)]))
df["event"] = None
df
         date event
0  2021-01-01  None
1  2021-01-02  None
2  2021-01-03  None
3  2021-01-04  None
4  2021-01-05  None
5  2021-01-06  None
6  2021-01-07  None
7  2021-01-08  None
8  2021-01-09  None
9  2021-01-10  None
10 2021-01-11  None
11 2021-01-12  None
12 2021-01-13  None
13 2021-01-14  None
14 2021-01-15  None

# Set events
event = [to_datetime("20210105"), to_datetime("20210112")]

Then set the dates in the event column and do a simple ffill and bfill (in that order)

df.loc[df["date"].isin(event), "event"] = event
df
         date                event
0  2021-01-01                 None
1  2021-01-02                 None
2  2021-01-03                 None
3  2021-01-04                 None
4  2021-01-05  2021-01-05 00:00:00
5  2021-01-06                 None
6  2021-01-07                 None
7  2021-01-08                 None
8  2021-01-09                 None
9  2021-01-10                 None
10 2021-01-11                 None
11 2021-01-12  2021-01-12 00:00:00
12 2021-01-13                 None
13 2021-01-14                 None
14 2021-01-15                 None


df["event"] = df["event"].ffill().bfill()
df
         date      event
0  2021-01-01 2021-01-05
1  2021-01-02 2021-01-05
2  2021-01-03 2021-01-05
3  2021-01-04 2021-01-05
4  2021-01-05 2021-01-05
5  2021-01-06 2021-01-05
6  2021-01-07 2021-01-05
7  2021-01-08 2021-01-05
8  2021-01-09 2021-01-05
9  2021-01-10 2021-01-05
10 2021-01-11 2021-01-05
11 2021-01-12 2021-01-12
12 2021-01-13 2021-01-12
13 2021-01-14 2021-01-12
14 2021-01-15 2021-01-12

Finish it all of :

df["days_since"] = df["date"] - df["event"]
df
         date      event days_since
0  2021-01-01 2021-01-05    -4 days
1  2021-01-02 2021-01-05    -3 days
2  2021-01-03 2021-01-05    -2 days
3  2021-01-04 2021-01-05    -1 days
4  2021-01-05 2021-01-05     0 days
5  2021-01-06 2021-01-05     1 days
6  2021-01-07 2021-01-05     2 days
7  2021-01-08 2021-01-05     3 days
8  2021-01-09 2021-01-05     4 days
9  2021-01-10 2021-01-05     5 days
10 2021-01-11 2021-01-05     6 days
11 2021-01-12 2021-01-12     0 days
12 2021-01-13 2021-01-12     1 days
13 2021-01-14 2021-01-12     2 days
14 2021-01-15 2021-01-12     3 days

Clean up (and change to integers if you want):

del df["event"]; df["days_since"] = df["days_since"].dt.days
df
         date  days_since
0  2021-01-01          -4
1  2021-01-02          -3
2  2021-01-03          -2
3  2021-01-04          -1
4  2021-01-05           0
5  2021-01-06           1
6  2021-01-07           2
7  2021-01-08           3
8  2021-01-09           4
9  2021-01-10           5
10 2021-01-11           6
11 2021-01-12           0
12 2021-01-13           1
13 2021-01-14           2
14 2021-01-15           3
Stryder
  • 848
  • 6
  • 9