3

I have a massive 5M row list of customer, period, and orders and need to attach rolling window lookups of 3, 6, and 12 months. Below is a sample of the data:

I have a dataframe dftest1:

{'period': {0: 201810, 1: 201811, 2: 201812, 3: 201901, 4: 201902, 5: 201903, 6: 201904, 7: 201905, 8: 201906, 9: 201907, 10: 201908, 11: 201909, 12: 201910, 13: 201911, 14: 201912, 15: 202001, 16: 202002, 17: 202003, 18: 202004, 19: 202005, 20: 202006}, 'customer': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'ABC', 6: 'ABC', 7: 'ABC', 8: 'ABC', 9: 'ABC', 10: 'ABC', 11: 'ABC', 12: 'ABC', 13: 'ABC', 14: 'ABC', 15: 'ABC', 16: 'ABC', 17: 'ABC', 18: 'ABC', 19: 'ABC', 20: 'ABC'}, 'orders': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0, 10: 0.0, 11: 1.0, 12: 0.0, 13: 0.0, 14: 0.0, 15: 1.0, 16: 0.0, 17: 2.0, 18: 1.0, 19: 1.0, 20: 1.0}}

Now the following runs and delivers exactly what I want but it takes a very long time to run on my full dataset:

dftest1['countOfOrdersLast3months']=''
dftest1['countOfOrdersLast6months']=''
dftest1['countOfOrdersLast12months']=''
for x, y in dftest1.iterrows():
    customer = dftest1['customer'].values[x]
    currentDate = dftest1['period'].values[x]
    dftest1['countOfOrdersLast3months'].values[x] = dftest1[(dftest1['customer']==customer) & ((dftest1['period'] > currentDate-3) & (dftest1['period'] <= currentDate))]['orders'].sum()
    dftest1['countOfOrdersLast6months'].values[x]  = dftest1[(dftest1['customer']==customer) & ((dftest1['period'] > currentDate-6) & (dftest1['period'] <= currentDate))]['orders'].sum()
    dftest1['countOfOrdersLast12months'].values[x] = dftest1[(dftest1['customer']==customer) & ((dftest1['period'] > currentDate-12) & (dftest1['period'] <= currentDate))]['orders'].sum()
dftest1.head()

Is there a better way to do this faster?

user3486773
  • 1,174
  • 3
  • 25
  • 50

2 Answers2

1

Assuming you have all the continuous months. Let's do a for loop on the months with groupby:

groups = dftest1.groupby('customer')

for months in [3,6,12]:
    dftest1[f'countOrderLast{months}months'] = (groups['orders']
                                             .rolling(months).sum()
                                             .reset_index('customer',drop=True)
                                          )

Output:

      period  customer      orders    countOrderLast3months    countOrderLast6months    countOrderLast12months
--  --------  ----------  --------  -----------------------  -----------------------  ------------------------
 0    201810  ABC                0                      nan                      nan                       nan
 1    201811  ABC                0                      nan                      nan                       nan
 2    201812  ABC                0                        0                      nan                       nan
 3    201901  ABC                0                        0                      nan                       nan
 4    201902  ABC                0                        0                      nan                       nan
 5    201903  ABC                0                        0                        0                       nan
 6    201904  ABC                0                        0                        0                       nan
 7    201905  ABC                0                        0                        0                       nan
 8    201906  ABC                0                        0                        0                       nan
 9    201907  ABC                0                        0                        0                       nan
10    201908  ABC                0                        0                        0                       nan
11    201909  ABC                1                        1                        1                         1
12    201910  ABC                0                        1                        1                         1
13    201911  ABC                0                        1                        1                         1
14    201912  ABC                0                        0                        1                         1
15    202001  ABC                1                        1                        2                         2
16    202002  ABC                0                        1                        2                         2
17    202003  ABC                2                        3                        3                         4
18    202004  ABC                1                        3                        4                         5
19    202005  ABC                1                        4                        5                         6
20    202006  ABC                1                        3                        6                         7
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
1

You can try something like this

df['cumorders'] = df.groupby('customer')['orders'].cumsum()
df['countOfOrdersLast3months'] = df.groupby('customer')['cumorders'].diff(3)

and similar for 6m etc

Here in the first line we calculate cumulative orders (per customer), and in the second line we take 3-step differences of the cumorders, ie orders within 3 months

This is what I get from your example

period  customer    orders  cumorders   countOfOrdersLast3months
0   201810  ABC 0.0 0.0 NaN
1   201811  ABC 0.0 0.0 NaN
2   201812  ABC 0.0 0.0 NaN
3   201901  ABC 0.0 0.0 0.0
4   201902  ABC 0.0 0.0 0.0
5   201903  ABC 0.0 0.0 0.0
6   201904  ABC 0.0 0.0 0.0
7   201905  ABC 0.0 0.0 0.0
8   201906  ABC 0.0 0.0 0.0
9   201907  ABC 0.0 0.0 0.0
10  201908  ABC 0.0 0.0 0.0
11  201909  ABC 1.0 1.0 1.0
12  201910  ABC 0.0 1.0 1.0
13  201911  ABC 0.0 1.0 1.0
14  201912  ABC 0.0 1.0 0.0
15  202001  ABC 1.0 2.0 1.0
16  202002  ABC 0.0 2.0 1.0
17  202003  ABC 2.0 4.0 3.0
18  202004  ABC 1.0 5.0 3.0
19  202005  ABC 1.0 6.0 4.0
20  202006  ABC 1.0 7.0 3.0
piterbarg
  • 8,089
  • 2
  • 6
  • 22