pandas - adding new aggregated feature

Question

I have this dataframe in pandas:

   day customer  amount
0    1    cust1     500
1    2    cust2     100
2    1    cust1      50
3    2    cust1     100
4    2    cust2     250
5    6    cust1      20

For convenience:

df = pd.DataFrame({'day': [1, 2, 1, 2, 2, 6],
                   'customer': ['cust1', 'cust2', 'cust1', 'cust1', 'cust2', 'cust1'],
                   'amount': [500, 100, 50, 100, 250, 20]})

I would like to create a new column 'amount2days' so as to aggragate amounts per customer for the last two days, to get the following dataframe:

   day customer  amount    amount2days   ----------------------------
0    1    cust1     500    500           (no past transactions)
1    2    cust2     100    100           (no past transactions)
2    1    cust1      50    550           (500 + 50 = rows 0,2 
3    2    cust1     100    650           (500 + 50 + 100, rows 0,2,3)
4    2    cust2     250    350           (100 + 250, rows 1,4) 
5    6    cust1      20    20            (notice day is 6, and no day=5 for cust1)

i.e. I would like to perform the following (pseudo) code:

df['amount2days'] = df_of_past_2_days['amount'].sum()

for each row. What is the most convenient way to do so?

The summing I wish to peroform is over the day, but days does not necessarily have to increment in each new row, as shown in the example. Still I want to sum amounts over the past 2 days.

Does this answer your question? [pandas rolling sum of last five minutes](https://stackoverflow.com/questions/27062652/pandas-rolling-sum-of-last-five-minutes) — code11, Oct 06 '20 at 13:41

jezrael · Accepted Answer · 2020-10-08T07:19:57.323

Use groupby with Series.rolling and sum

Notice: Here is necessary add DataFrame.reset_index for avoid wrong alignement of data:

df['amount2days'] = (df.groupby('customer')['amount']
                       .rolling(2, min_periods=0)
                       .sum()
                       .reset_index(level=0, drop=True))
print (df)
   day customer  amount  amount2days
1    1    cust1     500        500.0
2    2    cust1     100        600.0
3    3    cust1     250        350.0

Why not use .to_numpy here? Because if not default index then output should be assigned wrongly - check sample below:

df = pd.DataFrame({'day': {0: 1, 2: 2, 5: 3, 1: 1, 6: 2, 4: 3}, 'customer': {0: 'cust2', 2: 'cust2', 5: 'cust2', 1: 'cust1', 6: 'cust1', 4: 'cust1'}, 'amount': {0: 5000, 2: 1000, 5: 2500, 1: 500, 6: 100, 4: 250}})
print (df)
   day customer  amount
0    1    cust2    5000
2    2    cust2    1000
5    3    cust2    2500
1    1    cust1     500
6    2    cust1     100
4    3    cust1     250

df['amount2days'] = (df.groupby('customer', sort=False).amount
                       .rolling(2, min_periods=0)
                       .sum()
                       .to_numpy())

df['amount2days1'] = (df.groupby('customer')['amount']
                       .rolling(2, min_periods=0)
                       .sum()
                       .reset_index(level=0, drop=True))
print (df)
   day customer  amount  amount2days  amount2days1
0    1    cust2    5000        500.0        5000.0
2    2    cust2    1000        600.0        6000.0
5    3    cust2    2500        350.0        3500.0
1    1    cust1     500       5000.0         500.0
6    2    cust1     100       6000.0         600.0
4    3    cust1     250       3500.0         350.0

EDIT: General solution:

def f(x):
    N = 1
    for i in pd.unique(x['day']):
        y = x[x['day'].between(i - N, i)]
        x.loc[y.index[-1], 'amountNdays'] = y['amount'].sum()
    
    return x

df = df.groupby('customer').apply(f)
df['amountNdays'] = df['amountNdays'].fillna(df['amount'])
print (df)
   day customer  amount  amountNdays
0    1    cust1     500        500.0
1    2    cust2     100        100.0
2    1    cust1      50        550.0
3    2    cust1     100        650.0
4    2    cust2     250        350.0
5    6    cust1      20         20.0

Thanks. I've updated my question to be more clear. My data does not necessarily increments 'day' in each new row, still I want to sum over 2 days backwards. Does simple rolling will work in that case? — chendoy, Oct 06 '20 at 15:23
still not working in the case that there is nothing to sum for backwards rows, I edited my example once again (should be just 20, but this method gives 120 there). Thank you. — chendoy, Oct 07 '20 at 15:04
@jezarel with "there" I mean the 'amount2days' of the last row. — chendoy, Oct 07 '20 at 15:12
and maybe there is a way to generalize this to amount_x_days easily? — chendoy, Oct 07 '20 at 15:52

yatu · Answer 2 · 2020-10-06T14:23:24.890

1

You can use panda's rolling for moving window operations (depending on the version of pandas, reset_index as in jezrael's answer will be safer):

df['amount2days'] = (df.groupby('customer', sort=False).amount
                       .rolling(2, min_periods=0)
                       .sum()
                       .to_numpy()) 

print(df)
   day customer  amount  amount2days
1    1    cust1     500        500.0
2    2    cust1     100        600.0
3    3    cust1     250        350.0

edited Oct 06 '20 at 14:23

answered Oct 06 '20 at 13:40

yatu

86,083
12
84
139

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/222636/discussion-on-answer-by-yatu-pandas-adding-new-aggregated-feature). – Samuel Liew Oct 07 '20 at 07:09

pandas - adding new aggregated feature

2 Answers2