0

I have dataset as below:

import pandas as pd 

data = pd.DataFrame({
        'ID':  ['27459', '27459', '27459', '27459', '27459', '27459', '27459', '48002', '48002', '48002'],
        'Invoice_Date': ['2020-06-26', '2020-06-29', '2020-06-30', '2020-07-14', '2020-07-25', 
                         '2020-07-30', '2020-08-02', '2020-05-13', '2020-06-20', '2020-06-28'],
        'Payment_Term': [7,8,3,6,4,7,8,5,3,6],
        'Payment_Date': ['2020-07-05', '2020-07-05','2020-07-03', '2020-07-21', '2020-07-31', 
                         '2020-08-15', '2020-08-22', '2020-06-16', '2020-06-23', '2020-07-05'],
        'Due_Date': ['2020-07-03', '2020-07-07', '2020-07-03', '2020-07-20', '2020-07-29', 
                         '2020-08-06', '2020-08-10', '2020-05-18', '2020-06-23', '2020-07-04'],
        'Delay': [2,-2,0,1,2,9,12,29,0,1],
        'Difference_Date': [0,3,1,14,11,5,3,0,38,8],
        })
data

I want to add another column for Average_Diff that calculates the average of rows before (excluding the current row) and grouping it by ID.

I have tried df['Average_Diff'] = df.Difference_Date.rolling(window=2).mean() and df['Average_Diff'] = df.Difference_Date.expanding().mean() but it is not working. My expected output is:

ID     Invoice_Date Payment_Term  Payment_Date       Due_Date  Delay    Difference_Date  Average_Diff

27459    2020-06-26            7    2020-07-05     2020-07-03      2                  0             0
27459    2020-06-29            8    2020-07-05     2020-07-07     -2                  3             0
27459    2020-06-30            3    2020-07-03     2020-07-03      0                  1           1.5
27459    2020-07-14            6    2020-07-21     2020-07-20      1                  14         1.33  
27459    2020-07-25            4    2020-07-31     2020-07-29      2                  11          4.5
27459    2020-07-30            7    2020-08-15     2020-08-06      9                  5           5.8
27459    2020-08-02            8    2020-08-22     2020-08-10      12                 3          5.67
48002    2020-05-13            5    2020-06-16     2020-05-18      29                 0             0
48002    2020-06-20            3    2020-06-23     2020-06-23      0                  38            0
48002    2020-06-28            6    2020-07-05     2020-07-04      1                  8            19
  
mojek
  • 195
  • 1
  • 9
  • 1
    Use `df['Average_Diff'] = df.groupby('ID').Difference_Date.apply(lambda x: x.expanding().mean().shift()).fillna(0)` – jezrael Nov 16 '20 at 07:37
  • does this help: https://stackoverflow.com/questions/50593797/pandas-expanding-mean-with-group-by-and-before-current-row-date? – Pygirl Nov 16 '20 at 07:37
  • 1
    You can also use: `data['Average_Diff'] = ((data.groupby('ID').Difference_Date.cumsum() - data.Difference_Date)/ data.groupby('ID').cumcount()).fillna(0)` – Mayank Porwal Nov 16 '20 at 07:40

0 Answers0