1

I have a dataframe as follows:

   Date        Group   Value   Duration
2018-01-01      A      20       30
2018-02-01      A      10       60
2018-01-01      B      15      180
2018-02-01      B      30      210
2018-03-01      B      25      238 
2018-01-01      C      10      235
2018-02-01      C      15      130

I want to use group_by dynamically i.e. do not wish to type the column names on which group_by would be applied. Specifically, I want to compute mean of each Group for last two months.

As we can see that not each Group's data is present in the above dataframe for all dates. So the tasks are as follows:

  1. Add a dummy row based on the date, in case data pertaining to Date = 2018-03-01not present for each Group (e.g. add row for A and C).

  2. Perform group_by to compute mean using last two month's Value and Duration.

So my approach is as follows:

For Task 1:

 s = pd.MultiIndex.from_product(df['Date'].unique(),df['Group'].unique()],names=['Date','Group'])
 df = df.set_index(['Date','Group']).reindex(s).reset_index().sort_values(['Group','Date']).ffill(axis=0) 

can we have a better method for achieving the 'add row' task? The reference is found here.

For Task 2:

def cond_grp_by(df,grp_by:str,cols_list:list,*args):
    df_grp = df.groupby(grp_by)[cols_list].transform(lambda x : x.tail(2).mean())
    return df_grp
df_cols = df.columns.tolist()
df = cond_grp_by(dealer_f_filt,'Group',df_cols)

Reference of the above approach is found here.

The above code is throwing IndexError : Column(s) ['index','Group','Date','Value','Duration'] already selected

The expected output is

Group    Value    Duration
  A       10         60 <---------   Since a row is added for 2018-03-01 with 
  B       27.5      224              same value as 2018-02-01,we are 
  C       15        130 <---------   computing mean for last two values
pythondumb
  • 1,187
  • 1
  • 15
  • 30

1 Answers1

2

Use GroupBy.agg instead transform if need output filled by aggregate values:

def cond_grp_by(df,grp_by:str,cols_list:list,*args):
    return df.groupby(grp_by)[cols_list].agg(lambda x : x.tail(2).mean()).reset_index()
df = cond_grp_by(df,'Group',df_cols)
print (df)
  Group  Value  Duration
0     A   10.0      60.0
1     B   27.5     224.0
2     C   15.0     130.0

If need last value per groups use GroupBy.last:

def cond_grp_by(df,grp_by:str,cols_list:list,*args):
    return df.groupby(grp_by)[cols_list].last().reset_index()
df = cond_grp_by(df,'Group',df_cols)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • jezrael: if I use `last` as transformation i.e. `lambda x : x.last()`, it is asking for `last() missing 1 required positional argument: 'offset'`. Why do I need to provide the same? – pythondumb Jun 03 '20 at 11:05
  • @pythondumb - If need last is necessary use `df.groupby(grp_by)[cols_list].last()`. Error means `last` [`GroupBy.last`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.last.html) is implemented only for groupby object, so cannot be used in `agg` (then is called [`Series.last`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.last.html) - it is different function, last value by datetimeindex) – jezrael Jun 03 '20 at 11:08
  • 1
    jazrael: can you please stab the first question? – pythondumb Jun 03 '20 at 12:37