1

I have this data frame:

ID      Date  X  123_Var  456_Var  789_Var
 A  16-07-19  3      777      250      810
 A  17-07-19  9      637      121      529
 A  20-07-19  2      295      272      490
 A  21-07-19  3      778      600      544
 A  22-07-19  6      741      792      907
 A  25-07-19  6      435      416      820
 A  26-07-19  8      590      455      342
 A  27-07-19  6      763      476      753
 A  02-08-19  6      717      211      454
 A  03-08-19  6      152      442      475
 A  05-08-19  6      564      340      302
 A  07-08-19  6      105      929      633
 A  08-08-19  6      948      366      586
 B  07-08-19  4      509      690      406
 B  08-08-19  2      413      725      414
 B  12-08-19  2      170      702      912
 B  13-08-19  3      851      616      477
 B  14-08-19  9      475      447      555
 B  15-08-19  1      412      403      708
 B  17-08-19  2      299      537      321
 B  18-08-19  4      310      119      125

I want to show the mean value of n last days (using Date column), excluding the value of current day.

I'm using this code (what should I do to fix this?):

n = 4

cols = list(df.filter(regex='Var').columns)

df = df.set_index('Date')

df[cols] = (df.groupby('ID').rolling(window=f'{n}D')[cols].mean()
         .reset_index(0,drop=True).add_suffix(f'_{n}'))

df.reset_index(inplace=True)

Expected result:

ID      Date  X  123_Var  456_Var  789_Var   123_Var_4   456_Var_4  789_Var_4
 A  16-07-19  3      777      250      810         NaN         NaN        NaN
 A  17-07-19  9      637      121      529  777.000000  250.000000      810.0
 A  20-07-19  2      295      272      490  707.000000  185.500000      669.5
 A  21-07-19  3      778      600      544  466.000000  196.500000      509.5
 A  22-07-19  6      741      792      907  536.500000  436.000000      517.0
 A  25-07-19  6      435      416      820  759.500000  696.000000      725.5
 A  26-07-19  8      590      455      342  588.000000  604.000000      863.5
 A  27-07-19  6      763      476      753  512.500000  435.500000      581.0
 A  02-08-19  6      717      211      454         NaN         NaN        NaN
 A  03-08-19  6      152      442      475  717.000000  211.000000      454.0
 A  05-08-19  6      564      340      302  434.500000  326.500000      464.5
 A  07-08-19  6      105      929      633  358.000000  391.000000      388.5
 A  08-08-19  6      948      366      586  334.500000  634.500000      467.5
 B  07-08-19  4      509      690      406         NaN         NaN        NaN
 B  08-08-19  2      413      725      414  509.000000  690.000000      406.0
 B  12-08-19  2      170      702      912  413.000000  725.000000      414.0
 B  13-08-19  3      851      616      477  291.500000  713.500000      663.0
 B  14-08-19  9      475      447      555  510.500000  659.000000      694.5
 B  15-08-19  1      412      403      708  498.666667  588.333333      648.0
 B  17-08-19  2      299      537      321  579.333333  488.666667      580.0
 B  18-08-19  4      310      119      125  395.333333  462.333333      528.0
  • Note: dataframe has changed.
qwerty
  • 889
  • 6
  • 16

1 Answers1

2

I change unutbu solution for working in rolling:

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

n = 5
cols = df.filter(regex='Var').columns
df = df.set_index('Date')
df_ = df.set_index('ID', append=True).swaplevel(1,0)
df1 = df.groupby('ID').rolling(window=f'{n}D')[cols].count()
df2 = df.groupby('ID').rolling(window=f'{n}D')[cols].mean()
df3 = (df1.mul(df2)
          .sub(df_[cols])
          .div(df1[cols].sub(1)).add_suffix(f'_{n}')
          )
df4 = df_.join(df3)

print (df4)
               X  123_Var  456_Var  789_Var   123_Var_5   456_Var_5  789_Var_5
ID Date                                                                       
A  2019-07-16  3      777      250      810         NaN         NaN        NaN
   2019-07-17  9      637      121      529  777.000000  250.000000      810.0
   2019-07-20  2      295      272      490  707.000000  185.500000      669.5
   2019-07-21  3      778      600      544  466.000000  196.500000      509.5
   2019-07-22  6      741      792      907  536.500000  436.000000      517.0
   2019-07-25  6      435      416      820  759.500000  696.000000      725.5
   2019-07-26  8      590      455      342  588.000000  604.000000      863.5
   2019-07-27  6      763      476      753  512.500000  435.500000      581.0
   2019-08-02  6      717      211      454         NaN         NaN        NaN
   2019-08-03  6      152      442      475  717.000000  211.000000      454.0
   2019-08-05  6      564      340      302  434.500000  326.500000      464.5
   2019-08-07  6      105      929      633  358.000000  391.000000      388.5
   2019-08-08  6      948      366      586  334.500000  634.500000      467.5
B  2019-08-07  4      509      690      406         NaN         NaN        NaN
   2019-08-08  2      413      725      414  509.000000  690.000000      406.0
   2019-08-12  2      170      702      912  413.000000  725.000000      414.0
   2019-08-13  3      851      616      477  170.000000  702.000000      912.0
   2019-08-14  9      475      447      555  510.500000  659.000000      694.5
   2019-08-15  1      412      403      708  498.666667  588.333333      648.0
   2019-08-17  2      299      537      321  579.333333  488.666667      580.0
   2019-08-18  4      310      119      125  395.333333  462.333333      528.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    I think we need to write `n=5` for 4 previous days. I've changed it to `n=5` and now it corresponds to the expected result. – qwerty Feb 03 '20 at 14:21
  • @qwerty - Yes, I was a bit surprise with all working nice, only last 4 rows was different, I hope now working correct. – jezrael Feb 03 '20 at 14:27
  • I'm trying to apply that on my data (a bit different), and get this error message: `Index._join_level on non-unique index is not implemented`. Do you know what does it mean? What is wrong? Thank you. – qwerty Feb 03 '20 at 14:29
  • @qwerty - I think yes, it means duplicates of datetimes per groups. So first step for unique should be removed dupes rows by `df = df.drop_duplicates(['ID','Date'])` or aggregation, e.g. `df = df.groupby(['ID','Date'], as_index=False).mean()` or `df = df.groupby(['ID','Date'], as_index=False).sum()` – jezrael Feb 03 '20 at 14:32
  • It's because of same dates (associated with different IDs). Check the updated dataframe, I've changed it a bit. – qwerty Feb 03 '20 at 14:42
  • 1
    @qwerty - There was problem in `df` is only `DatetimeIndex`, but in `df1` and `df2` is `Multiindex`, so for correct align datetimes is created `df_` with `MultiIndex`. – jezrael Feb 03 '20 at 14:54