0

I have data that looks like this below and I'm trying to calculate the CRMSE (centered root mean squared error) by site_name and year. Maybe i need an agg function or a lambda function to do this at each groupby parameters (plant_name, year). The dataframe data for df3m1:

     plant_name  year  month  obsvals  modelvals  
0     ARIZONA I  2021      1     8.90       8.30  
1     ARIZONA I  2021      2     7.98       7.41  
2     CAETITE I  2021      1     9.10       7.78  
3     CAETITE I  2021      2     6.05       6.02  

The equation that I need to implement by plant_name and year looks like this:

crmse = df3m1.groupby(['plant_name','year'])((  (df3m1.obsvals - df3m1.obsvals.mean())  - 
(df3m1.modelvals - df3m1.modelvals.mean())  ) ** 2).mean() ** .5

This is a bit advanced for me yet on how to integrate a groupby and a calculation at the same time. thank you. Final dataframe would look like:

  plant_name   year   crmse
0 ARIZONA I    2021     ?
1 CAETITE I    2021     ?

I have tried things like this with groupby -

crmse = df3m1.groupby(['plant_name','year'])((  (df3m1.obsvals - 
df3m1.obsvals.mean())  - (df3m1.modelvals - df3m1.modelvals.mean())  ) 
** 2).mean() ** .5

but get errors like this:

TypeError: 'DataFrameGroupBy' object is not callable
user2100039
  • 1,280
  • 2
  • 16
  • 31

1 Answers1

1

Using groupby is correct. After that, we would have used .agg normally, but computing csrme interacts with multiple columns (obsvals and modelvals). So we pass the entire dataframe then take columns as we want by using .apply.

Code:

def crmse(x, y):
    return np.sqrt(np.mean(np.square( (x - x.mean()) - (y - y.mean()) )))

def f(df):
    return pd.Series(crmse(df['obsvals'], df['modelvals']), index=['crmse'])

crmse_series = (
    df3m1
    .groupby(['plant_name', 'year'])
    .apply(f)
)

crmse_series 
                 crmse
plant_name year       
ARIZONA I  2021  0.015
CAETITE I  2021  0.645

You can merge the series into the original dataframe with merge.

df = df.merge(crmse_series, on=['plant_name', 'year'])
df
  plant_name  year  month  obsvals  modelvals  crmse
0  ARIZONA I  2021      1     8.90       8.30  0.015
1  ARIZONA I  2021      2     7.98       7.41  0.015
2  CAETITE I  2021      1     9.10       7.78  0.645
3  CAETITE I  2021      2     6.05       6.02  0.645

See Also:

Jun
  • 432
  • 4
  • 8