-1

I have a dataset:

import pandas as pd 

df = pd.DataFrame({
        'ID':  ['27459', '27459', '27459', '27459', '27459', '27459', '27459', '48002', '48002', '48002'],
        'Invoice_Date': ['2020-06-26', '2020-06-29', '2020-06-30', '2020-07-14', '2020-07-25', 
                         '2020-07-30', '2020-08-02', '2020-05-13', '2020-06-20', '2020-06-28'],
        'Difference_Date': [0,3,1,14,11,5,3,0,38,8],
        })
df

I need to add another column that is the average of rolling 30 days period. I tried using rolling but it gives me error window must be an integer. Since this is customer-based data, it need to be groupby ID as well.

My expected output is:

    ID      Invoice_Date    Difference_Date   Average
0   27459   2020-06-26      0                 0.00
1   27459   2020-06-29      3                 1.50
2   27459   2020-06-30      1                 1.33
3   27459   2020-07-14      14                4.50
4   27459   2020-07-25      11                5.80
5   27459   2020-07-30      5                 10.00
6   27459   2020-08-02      3                 8.25
7   48002   2020-05-13      0                 0.00
8   48002   2020-06-20      38                38.00
9   48002   2020-06-28      8                 23.00

Is there any efficient workaround for calculating average of rolling 30 days?

mojek
  • 195
  • 1
  • 9
  • 1
    Does this answer your question? [Moving Average Pandas](https://stackoverflow.com/questions/40060842/moving-average-pandas) – code11 Dec 08 '20 at 14:32
  • Does this answer your question? [Rolling Average in Pandas](https://stackoverflow.com/questions/54321463/rolling-average-in-pandas) – Joe Ferndz Dec 08 '20 at 23:03

1 Answers1

3

This is because pandas needs a DatetimeIndex to do df.rolling('30D'):

import pandas as pd 

df = pd.DataFrame({
        'ID':  ['27459', '27459', '27459', '27459', '27459', '27459', '27459', '48002', '48002', '48002'],
        'Invoice_Date': ['2020-06-26', '2020-06-29', '2020-06-30', '2020-07-14', '2020-07-25', 
                         '2020-07-30', '2020-08-02', '2020-05-13', '2020-06-20', '2020-06-28'],
        'Difference_Date': [0,3,1,14,11,5,3,0,38,8],
        })
df.index = pd.DatetimeIndex(df['Invoice_Date'])
df = df.sort_index()
df.rolling('30D')

Z Li
  • 4,133
  • 1
  • 4
  • 19
  • what if I need it to be sorted based on ID as well? – mojek Dec 08 '20 at 14:42
  • Calculate whatever you need using the rolling index and then sort by ID? Could you clarify why you need it to be sorted by ID? – Z Li Dec 08 '20 at 14:49
  • because the data is customer based data. so the rolling 30 days should be calculated based on each customer – mojek Dec 08 '20 at 14:57
  • This sounds like a groupby: try `df.groupby('ID').rolling('30D')` – Z Li Dec 08 '20 at 15:00