1

Given a dataframe as follows:

import pandas as pd
import numpy as np

np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 4)), index=dates, columns=['A_values', 'B_values', 'C_values', 'target'])

Out:

            A_values  B_values  C_values    target
2013-02-26  6.059783  7.333694  1.389472  3.126731
2013-02-27  9.972433  1.281624  1.789931  7.529254
2013-02-28  6.621605  7.843101  0.968944  0.585713
2013-03-01  9.623960  6.165574  0.866300  5.612724
2013-03-02  6.165247  9.638430  5.743043  3.711608
             ...       ...       ...       ...
2013-05-22  0.589729  6.479978  3.531450  6.872059
2013-05-23  6.279065  3.837670  8.853146  8.209883
2013-05-24  5.533017  5.241127  1.388056  5.355926
2013-05-25  1.596038  4.665995  2.406251  1.971875
2013-05-26  3.269001  1.787529  6.659690  7.545569

I need to resample by month and then calculate mape between A_values, B_values, C_values and target column.

My final objective is to find two columns from A_values, B_values, C_values which have smaller mape values.

The expected result could be a new dataframe from a updated original dataframe by keeping values which have smaller mape values but resetting values of other columns with NaNs.

# function to calculate mape
def mape(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / np.maximum(np.ones(len(y_true)), np.abs(y_true))))*100

# Pseudocode to calculate mape between value columns and target column
A_error = df.resample('M').apply(mape(df['A_values'], df['target']))
B_error = df.resample('M').apply(mape(df['B_values'], df['target']))
C_error = df.resample('M').apply(mape(df['C_values'], df['target']))

At this step, we compare errors between A_values, B_values and C_values for each year-month, ie., for Feb 2013, we find A_values and B_values column has smaller mape values with target column. So I will set the rest column C_values's value by NaNs.

The final expected result could be like this (not exact result, only show the format):

            A_values  B_values  C_values    target
2013-02-26  6.059783  7.333694       NaN  3.126731
2013-02-27  9.972433  1.281624       NaN  7.529254
2013-02-28  6.621605  7.843101       NaN  0.585713
2013-03-01  9.623960       NaN  0.866300  5.612724
2013-03-02  6.165247       NaN  5.743043  3.711608
             ...       ...       ...       ...
2013-05-22       NaN  6.479978  3.531450  6.872059
2013-05-23       NaN  3.837670  8.853146  8.209883
2013-05-24       NaN  5.241127  1.388056  5.355926
2013-05-25       NaN  4.665995  2.406251  1.971875
2013-05-26       NaN  1.787529  6.659690  7.545569

How could I achieve that in Pandas? Thanks for your help at advance.

EDIT: if we implement for each whole column, with code below, we find C_values and A_values have smaller mape, so I will replace values of B_values with NaNs.

mape(df['A_values'], df['target'])
Out[65]: 1.1619069176493515
mape(df['B_values'], df['target'])
Out[66]: 1.3477232830950627
mape(df['C_values'], df['target'])
Out[67]: 1.045038328819569

But since I hope to groupby year-month (or resample by month) and then comprare them, thus I raise a question for a help.

Reference link:

Pandas Resample Apply Custom Function?

ah bon
  • 9,293
  • 12
  • 65
  • 148
  • I copy the function from the answer in this link: https://stackoverflow.com/questions/65216794/importerror-when-importing-metric-from-sklearn – ah bon Nov 11 '21 at 16:58
  • Maybe multiple by 100 or not doesn't matter, since my objective is to find two columns with smaller `mape`s. – ah bon Nov 11 '21 at 17:03
  • If `mape` is not suitable, how about `mse` or `mae`? – ah bon Nov 11 '21 at 17:05
  • You could consider `A_values`, `B_values`, `C_values` are prediction results from model `A`, `B` and `C`, I hope to find top 2 accurate models for each month, I mesure their accuracy by `mape` or `mae`, etc. – ah bon Nov 11 '21 at 17:10
  • 1
    Yeah, I get it now. See updated answer. – Quang Hoang Nov 11 '21 at 17:20

1 Answers1

1

You can rewrite your mape function so it works on pandas dataframe, then groupby().apply, and finally merge_asof to merge back:

# all your models
models = df.columns[:-1]
# models = df.columns[df.columns.str.endswith('_values')]

# function to calculate mape
def mape(y_true, y_pred):
    y_pred = np.array(y_pred)
    return np.mean(np.abs(y_true - y_pred) / np.clip(np.abs(y_true), 1, np.inf),
                   axis=0)*100

errors = (df.groupby(pd.Grouper(freq='M'))
            .apply(lambda x: mape(x[models], x[['target']]))
         )
aligned_mape = pd.merge_asof(df[['target']], errors, 
                             left_index=True, 
                             right_index=True,
                             direction='forward'
                            )

mask = (pd.get_dummies(aligned_mape[models].idxmax(axis=1))
          .reindex(columns=models, fill_value=0).astype(bool)
       )

df[models] = df[models].mask(mask)

Output:

            A_values  B_values  C_values    target
2013-02-26  6.059783       NaN  1.389472  3.126731
2013-02-27  9.972433       NaN  1.789931  7.529254
2013-02-28  6.621605       NaN  0.968944  0.585713
2013-03-01  9.623960       NaN  0.866300  5.612724
2013-03-02  6.165247       NaN  5.743043  3.711608
...              ...       ...       ...       ...
2013-05-22       NaN  6.479978  3.531450  6.872059
2013-05-23       NaN  3.837670  8.853146  8.209883
2013-05-24       NaN  5.241127  1.388056  5.355926
2013-05-25       NaN  4.665995  2.406251  1.971875
2013-05-26       NaN  1.787529  6.659690  7.545569
ah bon
  • 9,293
  • 12
  • 65
  • 148
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Many thanks for your kind help, for the last row of your code: we can also use `df.filter(regex='_values$').mask(mask)`, right? – ah bon Nov 11 '21 at 17:31
  • 1
    @ahbon certainly, but I'm not sure if you can use `df.filter(...)` on the left hand side. – Quang Hoang Nov 11 '21 at 17:32
  • Btw, it seems by filtering top N smaller `mape` or `mse` models' prediction result, we can also select some `bad` results, any other practical threshould or method to select multiple good results for each month? – ah bon Nov 11 '21 at 17:37
  • 1
    I don't know tbh. It seems that you are doing some sort of ensemble modelling. Practically, I believe aggregating models' outputs (e.g. mean/weighted mean) provides better predictions, given that your models are structurally different, e.g. random forests and DNN, not two random forests. You can try taking mean of `df[models].mask(mask).mean(axis=1)` as the final prediction to see if it's better. Other than that, I don't have any clues. – Quang Hoang Nov 11 '21 at 17:41
  • Sorry, one more question, if I have 20 models predictions, and I want to filter top 5 for each month, how could I modify mask? – ah bon Nov 12 '21 at 00:59