I have a dataframe as follows:
Date Group Value Duration
2018-01-01 A 20 30
2018-02-01 A 10 60
2018-01-01 B 15 180
2018-02-01 B 30 210
2018-03-01 B 25 238
2018-01-01 C 10 235
2018-02-01 C 15 130
I want to use group_by
dynamically i.e. do not wish to type the column names on which group_by
would be applied. Specifically, I want to compute mean
of each Group for last two months.
As we can see that not each Group's data is present in the above dataframe for all dates. So the tasks are as follows:
Add a dummy row based on the date, in case data pertaining to
Date = 2018-03-01
not present for each Group (e.g. add row for A and C).Perform
group_by
to compute mean using last two month'sValue
andDuration
.
So my approach is as follows:
For Task 1:
s = pd.MultiIndex.from_product(df['Date'].unique(),df['Group'].unique()],names=['Date','Group'])
df = df.set_index(['Date','Group']).reindex(s).reset_index().sort_values(['Group','Date']).ffill(axis=0)
can we have a better method for achieving the 'add row' task? The reference is found here.
For Task 2:
def cond_grp_by(df,grp_by:str,cols_list:list,*args):
df_grp = df.groupby(grp_by)[cols_list].transform(lambda x : x.tail(2).mean())
return df_grp
df_cols = df.columns.tolist()
df = cond_grp_by(dealer_f_filt,'Group',df_cols)
Reference of the above approach is found here.
The above code is throwing IndexError : Column(s) ['index','Group','Date','Value','Duration'] already selected
The expected output is
Group Value Duration
A 10 60 <--------- Since a row is added for 2018-03-01 with
B 27.5 224 same value as 2018-02-01,we are
C 15 130 <--------- computing mean for last two values