How to append column to Panda dataframe based on a rolling window lookup?

Question

I have a massive 5M row list of customer, period, and orders and need to attach rolling window lookups of 3, 6, and 12 months. Below is a sample of the data:

I have a dataframe dftest1:

{'period': {0: 201810, 1: 201811, 2: 201812, 3: 201901, 4: 201902, 5: 201903, 6: 201904, 7: 201905, 8: 201906, 9: 201907, 10: 201908, 11: 201909, 12: 201910, 13: 201911, 14: 201912, 15: 202001, 16: 202002, 17: 202003, 18: 202004, 19: 202005, 20: 202006}, 'customer': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'ABC', 6: 'ABC', 7: 'ABC', 8: 'ABC', 9: 'ABC', 10: 'ABC', 11: 'ABC', 12: 'ABC', 13: 'ABC', 14: 'ABC', 15: 'ABC', 16: 'ABC', 17: 'ABC', 18: 'ABC', 19: 'ABC', 20: 'ABC'}, 'orders': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0, 10: 0.0, 11: 1.0, 12: 0.0, 13: 0.0, 14: 0.0, 15: 1.0, 16: 0.0, 17: 2.0, 18: 1.0, 19: 1.0, 20: 1.0}}

Now the following runs and delivers exactly what I want but it takes a very long time to run on my full dataset:

dftest1['countOfOrdersLast3months']=''
dftest1['countOfOrdersLast6months']=''
dftest1['countOfOrdersLast12months']=''
for x, y in dftest1.iterrows():
    customer = dftest1['customer'].values[x]
    currentDate = dftest1['period'].values[x]
    dftest1['countOfOrdersLast3months'].values[x] = dftest1[(dftest1['customer']==customer) & ((dftest1['period'] > currentDate-3) & (dftest1['period'] <= currentDate))]['orders'].sum()
    dftest1['countOfOrdersLast6months'].values[x]  = dftest1[(dftest1['customer']==customer) & ((dftest1['period'] > currentDate-6) & (dftest1['period'] <= currentDate))]['orders'].sum()
    dftest1['countOfOrdersLast12months'].values[x] = dftest1[(dftest1['customer']==customer) & ((dftest1['period'] > currentDate-12) & (dftest1['period'] <= currentDate))]['orders'].sum()
dftest1.head()

Is there a better way to do this faster?

sorry that was a mistake when posting the question. I fixed it. should be dftest1. — user3486773, Nov 25 '20 at 21:37

score 1 · Answer 1 · answered Nov 25 '20 at 21:41

Assuming you have all the continuous months. Let's do a for loop on the months with groupby:

groups = dftest1.groupby('customer')

for months in [3,6,12]:
    dftest1[f'countOrderLast{months}months'] = (groups['orders']
                                             .rolling(months).sum()
                                             .reset_index('customer',drop=True)
                                          )

Output:

      period  customer      orders    countOrderLast3months    countOrderLast6months    countOrderLast12months
--  --------  ----------  --------  -----------------------  -----------------------  ------------------------
 0    201810  ABC                0                      nan                      nan                       nan
 1    201811  ABC                0                      nan                      nan                       nan
 2    201812  ABC                0                        0                      nan                       nan
 3    201901  ABC                0                        0                      nan                       nan
 4    201902  ABC                0                        0                      nan                       nan
 5    201903  ABC                0                        0                        0                       nan
 6    201904  ABC                0                        0                        0                       nan
 7    201905  ABC                0                        0                        0                       nan
 8    201906  ABC                0                        0                        0                       nan
 9    201907  ABC                0                        0                        0                       nan
10    201908  ABC                0                        0                        0                       nan
11    201909  ABC                1                        1                        1                         1
12    201910  ABC                0                        1                        1                         1
13    201911  ABC                0                        1                        1                         1
14    201912  ABC                0                        0                        1                         1
15    202001  ABC                1                        1                        2                         2
16    202002  ABC                0                        1                        2                         2
17    202003  ABC                2                        3                        3                         4
18    202004  ABC                1                        3                        4                         5
19    202005  ABC                1                        4                        5                         6
20    202006  ABC                1                        3                        6                         7

could you share a secret on how you generate such a pretty looking output. Is there more to it then cut-and-pasting from you jupyter notebook or whatever? — piterbarg, Nov 25 '20 at 21:48
Ah I didn't have that package and is why it didn't run for me. — user3486773, Nov 25 '20 at 21:50
I imported tabulate and still this doesn't run. Invalid syntax here "}months']" — user3486773, Nov 25 '20 at 21:55
@user3486773 you don’t need tabulate to run the code. It’s just for formatting the text. I used f-string which is available in Python 3. Do you use Python 2? — Quang Hoang, Nov 25 '20 at 21:57
I use python 3.5, but also have a 2.7 kernel. Looks like this is a 3.6 feature. Can you adjust the solution to run in 3.5? thank you! — user3486773, Nov 25 '20 at 21:59
@user3486773 I take it back, f-string might have been available in 3.6+. replace that `f'...'` with `'countOrderLast{}months'.format(months)` — Quang Hoang, Nov 25 '20 at 22:02
@user3486773 more info on [this question](https://stackoverflow.com/questions/55182209/f-string-invalid-syntax-in-python-3-5). — Quang Hoang, Nov 25 '20 at 22:03

score 1 · Accepted Answer · answered Nov 25 '20 at 21:44

You can try something like this

df['cumorders'] = df.groupby('customer')['orders'].cumsum()
df['countOfOrdersLast3months'] = df.groupby('customer')['cumorders'].diff(3)

and similar for 6m etc

Here in the first line we calculate cumulative orders (per customer), and in the second line we take 3-step differences of the cumorders, ie orders within 3 months

This is what I get from your example

period  customer    orders  cumorders   countOfOrdersLast3months
0   201810  ABC 0.0 0.0 NaN
1   201811  ABC 0.0 0.0 NaN
2   201812  ABC 0.0 0.0 NaN
3   201901  ABC 0.0 0.0 0.0
4   201902  ABC 0.0 0.0 0.0
5   201903  ABC 0.0 0.0 0.0
6   201904  ABC 0.0 0.0 0.0
7   201905  ABC 0.0 0.0 0.0
8   201906  ABC 0.0 0.0 0.0
9   201907  ABC 0.0 0.0 0.0
10  201908  ABC 0.0 0.0 0.0
11  201909  ABC 1.0 1.0 1.0
12  201910  ABC 0.0 1.0 1.0
13  201911  ABC 0.0 1.0 1.0
14  201912  ABC 0.0 1.0 0.0
15  202001  ABC 1.0 2.0 1.0
16  202002  ABC 0.0 2.0 1.0
17  202003  ABC 2.0 4.0 3.0
18  202004  ABC 1.0 5.0 3.0
19  202005  ABC 1.0 6.0 4.0
20  202006  ABC 1.0 7.0 3.0

I like it. Avoids loops altogether. Just what I needed. Thank you. — user3486773, Nov 25 '20 at 21:49

How to append column to Panda dataframe based on a rolling window lookup?

2 Answers2