6

I want to sum the prior N periods of data for each group. I have seen how to do each individually (sum by group, or sum prior N periods), but can't figure out a clean way to do both together.

I'm currently doing the following:

import pandas as pd

sample_data = {'user': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b'],\
                'clicks': [0,1,2,3,4,5,6,7,8,9]}
df = pd.DataFrame(sample_data)
df['clicks.1'] = df.groupby(['user'])['clicks'].shift(1)
df['clicks.2'] = df.groupby(['user'])['clicks'].shift(2)
df['clicks.3'] = df.groupby(['user'])['clicks'].shift(3)
df['total_clicks_prior3'] = df[['clicks.1','clicks.2', 'clicks.3']].sum(axis=1)

I don't want the 3 intermediate lagged columns, I just want the sum of those, so my desired output is:

>>> df[['clicks','user','total_clicks_prior3']]
   clicks user  total_clicks_prior3
0       0    a                  NaN
1       1    a                  0.0
2       2    a                  1.0
3       3    a                  3.0
4       4    a                  6.0
5       5    b                  NaN
6       6    b                  5.0
7       7    b                 11.0
8       8    b                 18.0
9       9    b                 21.0

Note: I could obviously drop the 3 columns after creating them, but given that I will be creating multiple columns of different numbers of lagged periods, I feel like there has to be an easier way.

Tchotchke
  • 3,061
  • 3
  • 22
  • 37
  • 1
    In your example, you don't seem to take into account the current row, but instead use the previous 3, which means the last row is ignored completely in results. Is this intended? – user3483203 Sep 04 '19 at 17:55
  • Yes, that is intended. In the real world I suppose I actually may not have the value for the current observation - good catch. – Tchotchke Sep 04 '19 at 17:59

2 Answers2

7

This is groupby + rolling + shift

df.groupby('user')['clicks'].rolling(3, min_periods=1).sum().groupby(level=0).shift()

user
a     0     NaN
      1     0.0
      2     1.0
      3     3.0
      4     6.0
b     5     NaN
      6     5.0
      7    11.0
      8    18.0
      9    21.0
Name: clicks, dtype: float64
user3483203
  • 50,081
  • 9
  • 65
  • 94
1

If you have a solution that works for each group, you can use apply to use it on the groupby object. For instance, you linked to a question that has df['A'].rolling(min_periods=1, window=11).sum() as an answer. If that does what you want on the subgroups, you can do

df.groupby('user').apply(lambda x: x['clicks'].rolling(min_periods=1, window=11).sum())
Acccumulation
  • 3,491
  • 1
  • 8
  • 12