Groupby year-month and find top N smallest values columns in Python

Question

Based on the output dataframe from this link:

import pandas as pd
import numpy as np

np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 6)), index=dates, columns=['A_values', 'B_values', 'C_values', 'D_values', 'E_values', 'target'])

# all your models
models = df.columns[df.columns.str.endswith('_values')]

# function to calculate mape
def mape(y_true, y_pred):
    y_pred = np.array(y_pred)
    return np.mean(np.abs(y_true - y_pred) / np.clip(np.abs(y_true), 1, np.inf),
                   axis=0)*100

errors = (df.groupby(pd.Grouper(freq='M'))
            .apply(lambda x: mape(x[models], x[['target']]))
         )
res = pd.merge_asof(df[['target']], errors, 
                             left_index=True, 
                             right_index=True,
                             direction='forward'
                            )
print(res)

Out:

              target    A_values    B_values    C_values    D_values   E_values
2013-02-26  1.281624   48.759348   77.023855  325.376455   74.422508  60.602101
2013-02-27  0.585713   48.759348   77.023855  325.376455   74.422508  60.602101
2013-02-28  9.638430   48.759348   77.023855  325.376455   74.422508  60.602101
2013-03-01  1.950960   98.909249  143.760594   90.051465  138.059241  93.461361
2013-03-02  0.690563   98.909249  143.760594   90.051465  138.059241  93.461361
             ...         ...         ...         ...         ...        ...
2013-05-22  5.554824  122.272490  139.420056  133.658101   62.368310  94.334362
2013-05-23  8.440801  122.272490  139.420056  133.658101   62.368310  94.334362
2013-05-24  0.968086  122.272490  139.420056  133.658101   62.368310  94.334362
2013-05-25  0.672555  122.272490  139.420056  133.658101   62.368310  94.334362
2013-05-26  5.273122  122.272490  139.420056  133.658101   62.368310  94.334362

How could I groupby year-month and find smallest top N value columns?

For example, if I set N=3, then the expected result will like:

Thanks for your help at advance.

How do you know which are the top N columns? By highest value of sum of the column, by highest average value of the column? By median? — Celius Stingher, Nov 12 '21 at 02:50
Sorry for the follow-up didn't know about this new 5 minutes editing thing. How do you know which are the top N columns? By highest value of sum of the column, by highest average value of the column? By median? Futhermore what if your period 2013-5 columns A,B,C are largest by sum and for period 2013-6 columns B,C,D are the largest ones? I have created some code which might work, but it depends on how you want to deal with these scenarios. — Celius Stingher, Nov 12 '21 at 03:00
Sorry, my mistake, I should find smallest columns as shown in the expected result since it's `mape` to measure errors. Please note, for each month and one column, the values are identical. So no need to sum up or calculate averages. — ah bon, Nov 12 '21 at 03:29
Please note that my answer provides topk **largest** `mape` columns. If you want topk **smallest** columns, you need to do `sorted_args = ... — Quang Hoang, Nov 12 '21 at 03:34

Quang Hoang · Accepted Answer · 2021-11-12T03:35:47.427

2

Here's an approach with argsort:

errors = (df.groupby(pd.Grouper(freq='M'))
            .apply(lambda x: mape(x[models], x[['target']]))
         )

k = 2            # your k here  


# filter top k models
sorted_args = np.argsort(errors, axis=1) < k

res = pd.merge_asof(df[['target']], sorted_args, 
                             left_index=True, 
                             right_index=True,
                             direction='forward'
                            )

topk = df[models].where(res[models])

Then topk looks like:

            A_values  B_values  C_values  D_values  E_values
2013-02-26  6.059783       NaN       NaN  3.126731       NaN
2013-02-27  1.789931       NaN       NaN  7.843101       NaN
2013-02-28  9.623960       NaN       NaN  5.612724       NaN
2013-03-01       NaN       NaN  4.521452       NaN  5.693051
2013-03-02       NaN       NaN  5.178144       NaN  7.322250
...              ...       ...       ...       ...       ...
2013-05-22       NaN       NaN  0.427136       NaN  6.803052
2013-05-23       NaN       NaN  2.225667       NaN  2.756443
2013-05-24       NaN       NaN  7.212742       NaN  0.430184
2013-05-25       NaN       NaN  5.384490       NaN  5.461017
2013-05-26       NaN       NaN  9.823048       NaN  6.312104

edited Nov 12 '21 at 03:35

answered Nov 12 '21 at 03:20

Quang Hoang

146,074
10
56
74

Sorry, my mistake, I should find top N smallest columns as shown in the expected result since it's mape to measure errors. – ah bon Nov 12 '21 at 03:31
1

@ahbon please see updated answer for the `k` columns with smallest `mape`. – Quang Hoang Nov 12 '21 at 03:36
1

@ahbon also see [this answer](https://stackoverflow.com/questions/44558215/python-justifying-numpy-array/44559180#44559180) to consolidate your top columns into actual `k` columns instead of `n` columns with NaN. – Quang Hoang Nov 12 '21 at 03:40
If I want to remove NaN columns for each year-month and rename `topk`'s column names with `best_model1` and `best_model2` using `k`, then join to original `df`? The final output's column names will be `target, A_values, B_values, C_values, D_values, E_values, best_model1, best_model2`. How could I do that? Please do not delete orginal code which is also useful, but add additional code to the end. – ah bon Nov 12 '21 at 03:45
That's what you used the linked answer for. Please note that answer retains the order of the columns, not the order of the mape values. – Quang Hoang Nov 12 '21 at 03:56
`df2 = topk.add_suffix('_mape').groupby(pd.Grouper(freq='M')).dropna(axis=1, how='all'); df.join(df2)`? – ah bon Nov 12 '21 at 04:00
No just `df.join(topk.add_suffix('_mape')`? – Quang Hoang Nov 12 '21 at 04:02
Yes, it's close, i also wanna to remove columns with all NaNs for each month. – ah bon Nov 12 '21 at 04:04
I think there is a problem with `np.argsort` to find lowest values, I post a new question: https://stackoverflow.com/questions/69950691/np-argsort-and-pd-nsmallest-give-different-results – ah bon Nov 13 '21 at 01:38
As I run `df[models].where(res[models])` in real data, I get an error； `ValueError: Boolean array expected for the condition, not object`. – ah bon Nov 15 '21 at 01:44
1

@ahbon maybe replacing `res[models]` with `res.reindex(models, fill_value=False)` helps. – Quang Hoang Nov 15 '21 at 02:54

Groupby year-month and find top N smallest values columns in Python

1 Answers1

Linked