4

I have a pandas dataframe which I wish to perform the same rolling operation on different groups within the data. Consider the following df (see bottom of question for code to construct) with four columns:

id      date       category   target
1    2017-01-01      'a'        0
1    2017-01-01      'b'        0
1    2017-01-21      'a'        1
1    2017-01-21      'b'        1
1    2017-10-01      'a'        0
1    2017-10-01      'b'        0
2    2017-01-01      'a'        1    
2    2017-01-01      'b'        1    
2    2017-01-21      'a'        0
2    2017-01-21      'b'        0
2    2017-10-01      'a'        0
2    2017-10-01      'b'        0

What I would like is an operation which calculates a boolean for each unique id-date pair indicating whether the target column is 1 within 6 months of the given date. So for the provided df I would expect a result which looks like:

id      date       one_within_6m
1    2017-01-01       True
1    2017-01-21       False
1    2017-10-01       False
2    2017-01-01       False
2    2017-01-21       False
2    2017-10-01       False

I can do this with a for loop iterating over the rows and looking 6 months in advance for each visit, but it is too slow due to the large size of my dataset.

So, I was wondering whether it was possible to groupby id the date and do a rolling operation on the time window to look at this? For example:

df_grouped = df.groupby(['id', 'date'])

# … do something to set date as index

# ... define some custom function

df_grouped.rolling('6m', on='target').apply(some_custom_function)

Some notes:

  • There can be multiple '1s' in the 6 month window, this should just be treated as True for the current date.

  • In my head some_custom_function will check whether the sum of target over the next 6 months (excluding current date) is greater than 1.

Supporting code:

To produce the DataFrame instance used in this question:

ids = np.concatenate([np.ones(6), np.ones(6)+1])
dates = ['2017-01-01','2017-01-01','2017-01-21','2017-01-21',
         '2017-10-01','2017-10-01','2017-01-01','2017-01-01',
         '2017-01-21','2017-01-21','2017-10-01','2017-10-01']
categories = ['a','b','a','b','a','b','a','b','a','b','a','b']
targets = [0,0,1,1,0,0,1,1,0,0,0,0]

df = pd.DataFrame({'id':ids,
                   'date':dates,
                   'category':categories,
                   'target':targets})

df['date'] = pd.to_datetime(df['date'])
FChm
  • 2,515
  • 1
  • 17
  • 37

1 Answers1

0

I have found a workable solution but it only works if for each id each date is unique. This is the case in my data with some additional processing:

new_df = df.groupby(['id','date']).mean().reset_index()

which returns:

    id      date      target
0   1.0   2017-01-01    0
1   1.0   2017-01-21    1
2   1.0   2017-10-01    0
3   2.0   2017-01-01    1
4   2.0   2017-01-21    0
5   2.0   2017-10-01    0

I can then use the rolling method on a groupby object to get the desired result:

df = new_df.set_index('date')

df.iloc[::-1].groupby('id')['target'].rolling(window='180D', 
    centre=False).apply(lambda x : x[:-1].sum())

There are two tricks here:

  1. I reverse the order of the dates (.iloc[::-1]) to take a forward looking window; this has been suggested in other SO questions.

  2. I drop the last entry of the sum to remove the 'current' date from the sum, so it only looks forward.

The second 'hack' means it only works when there are no repeats of dates for a given id.

I would be interested in making a more robust solution (e.g., where dates are repeated for an id).

FChm
  • 2,515
  • 1
  • 17
  • 37