I have a pandas dataframe which I wish to perform the same rolling operation on different groups within the data. Consider the following df
(see bottom of question for code to construct) with four columns:
id date category target
1 2017-01-01 'a' 0
1 2017-01-01 'b' 0
1 2017-01-21 'a' 1
1 2017-01-21 'b' 1
1 2017-10-01 'a' 0
1 2017-10-01 'b' 0
2 2017-01-01 'a' 1
2 2017-01-01 'b' 1
2 2017-01-21 'a' 0
2 2017-01-21 'b' 0
2 2017-10-01 'a' 0
2 2017-10-01 'b' 0
What I would like is an operation which calculates a boolean for each unique id-date pair indicating whether the target column is 1 within 6 months of the given date. So for the provided df I would expect a result which looks like:
id date one_within_6m
1 2017-01-01 True
1 2017-01-21 False
1 2017-10-01 False
2 2017-01-01 False
2 2017-01-21 False
2 2017-10-01 False
I can do this with a for loop iterating over the rows and looking 6 months in advance for each visit, but it is too slow due to the large size of my dataset.
So, I was wondering whether it was possible to groupby id the date and do a rolling operation on the time window to look at this? For example:
df_grouped = df.groupby(['id', 'date'])
# … do something to set date as index
# ... define some custom function
df_grouped.rolling('6m', on='target').apply(some_custom_function)
Some notes:
There can be multiple '1s' in the 6 month window, this should just be treated as True for the current date.
In my head
some_custom_function
will check whether the sum of target over the next 6 months (excluding current date) is greater than 1.
Supporting code:
To produce the DataFrame instance used in this question:
ids = np.concatenate([np.ones(6), np.ones(6)+1])
dates = ['2017-01-01','2017-01-01','2017-01-21','2017-01-21',
'2017-10-01','2017-10-01','2017-01-01','2017-01-01',
'2017-01-21','2017-01-21','2017-10-01','2017-10-01']
categories = ['a','b','a','b','a','b','a','b','a','b','a','b']
targets = [0,0,1,1,0,0,1,1,0,0,0,0]
df = pd.DataFrame({'id':ids,
'date':dates,
'category':categories,
'target':targets})
df['date'] = pd.to_datetime(df['date'])