Groupby year-month and drop columns with all NaNs in Python

Question

Based on the output dataframe from this link:

import pandas as pd
import numpy as np

np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 6)), index=dates, columns=['A_values', 'B_values', 'C_values', 'D_values', 'E_values', 'target'])

models = df.columns[df.columns.str.endswith('_values')]

# function to calculate mape
def mape(y_true, y_pred):
    y_pred = np.array(y_pred)
    return np.mean(np.abs(y_true - y_pred) / np.clip(np.abs(y_true), 1, np.inf),
                   axis=0)*100

errors = (df.groupby(pd.Grouper(freq='M'))
            .apply(lambda x: mape(x[models], x[['target']]))
         )

k = 2
n = len(models)

sorted_args = np.argsort(errors, axis=1) < k

res = pd.merge_asof(df[['target']], sorted_args, 
                             left_index=True, 
                             right_index=True,
                             direction='forward'
                            )

topk = df[models].where(res[models])

df = df.join(topk.add_suffix('_mape'))

df = df[['target', 'A_values_mape', 'B_values_mape', 'C_values_mape', 'D_values_mape',
       'E_values_mape']]
df

Out:

              target  A_values_mape  ...  D_values_mape  E_values_mape
2013-02-26  1.281624       6.059783  ...       3.126731            NaN
2013-02-27  0.585713       1.789931  ...       7.843101            NaN
2013-02-28  9.638430       9.623960  ...       5.612724            NaN
2013-03-01  1.950960            NaN  ...            NaN       5.693051
2013-03-02  0.690563            NaN  ...            NaN       7.322250
             ...            ...  ...            ...            ...
2013-05-22  5.554824            NaN  ...            NaN       6.803052
2013-05-23  8.440801            NaN  ...            NaN       2.756443
2013-05-24  0.968086            NaN  ...            NaN       0.430184
2013-05-25  0.672555            NaN  ...            NaN       5.461017
2013-05-26  5.273122            NaN  ...            NaN       6.312104

How could I groupby year-month and drop columns with all NaNs, then rename the rest columns by ie., top_1, top_2, ..., top_k?

The final expected result could be like this if k=2:

Pseudocode:

df2 = df.filter(regex='_mape$').groupby(pd.Grouper(freq='M')).dropna(axis=1, how='all')
df2.columns = ['top_1', 'top_2', ..., 'top_k']
df.join(df2)

As @Quang Hoang commented in the last post, we may could use justify_nd to achieve that, but I don't know how. Thanks for your help at advance.

EDIT:

dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 6)), index=dates, columns=['A_values', 'B_values', 'C_values', 'D_values', 'E_values', 'target'])

models = df.columns[df.columns.str.endswith('_values')]

k = 2
n = len(models)

def grpProc(grp):
    err = mape(grp[models], grp[['target']])
    # sort_args = np.argsort(err) < k
    # cols = models[sort_args]
    cols = err.nsmallest(k).index
    out_cols = [f'top_{i+1}' for i in range(k)]
    rv = grp.loc[:, cols]
    rv.columns = out_cols
    return rv

wrk = df.groupby(pd.Grouper(freq='M')).apply(grpProc)

res = df[['target']].join(wrk)
print(res)

Out:

              target     top_1     top_2
2013-02-26  1.281624  6.059783  9.972433
2013-02-27  0.585713  1.789931  0.968944
2013-02-28  9.638430  9.623960  6.165247
2013-03-01  1.950960  4.521452  5.693051
2013-03-02  0.690563  5.178144  7.322250
             ...       ...       ...
2013-05-22  5.554824  3.864723  6.803052
2013-05-23  8.440801  5.140268  2.756443
2013-05-24  0.968086  5.890717  0.430184
2013-05-25  0.672555  1.610210  5.461017
2013-05-26  5.273122  6.893207  6.312104

You wrote that you wanted to *group by year-month*. So why your expected result is **not** grouped by year-month? It contains individual rows for each day. — Valdi_Bo, Nov 12 '21 at 06:53
Groupby is only for dropping all `NaN`s columns for each year month, the final data is still on daily basis. — ah bon, Nov 12 '21 at 07:19

score 1 · Accepted Answer · edited Nov 12 '21 at 23:58

1

Actually, what you need is, for each group (by year / month):

compute errors locally for the current group,
find k "wanted" columns (calling argsort) and take indicated columns from models,
take the indicated columns from the current group and rename them to top_…,
return what you generated so far.

To do it, define a "group processing" function:

def grpProc(grp):
    err = mape(grp[models], grp[['target']])
    sort_args = np.argsort(err) < k
    cols = models[sort_args]
    out_cols = [f'top_{i+1}' for i in range(k)]
    rv = grp.loc[:, cols]
    rv.columns = out_cols
    return rv

Then, to generate top_… columns alone, apply this function to each group:

wrk = df.groupby(pd.Grouper(freq='M')).apply(grpProc)

And finally generate the expected result joining target column with wrk:

result = df[['target']].join(wrk)

First 15 rows of it, based on your source data, are:

              target     top_1     top_2
2013-02-26  1.281624  6.059783  3.126731
2013-02-27  0.585713  1.789931  7.843101
2013-02-28  9.638430  9.623960  5.612724
2013-03-01  1.950960  4.521452  5.693051
2013-03-02  0.690563  5.178144  7.322250
2013-03-03  6.177010  8.280144  6.174890
2013-03-04  1.263177  5.896541  4.422322
2013-03-05  5.888856  9.159396  8.906554
2013-03-06  2.013227  8.237912  3.075435
2013-03-07  8.482991  1.546148  6.476141
2013-03-08  7.986413  3.322442  4.738473
2013-03-09  5.944385  7.769769  0.631033
2013-03-10  7.543775  3.710198  6.787289
2013-03-11  5.816264  3.722964  6.795556
2013-03-12  3.054002  3.304891  8.258990

Edit

For the first group (2013-02-28) err contains:

A_values     48.759348
B_values     77.023855
C_values    325.376455
D_values     74.422508
E_values     60.602101

Note that 2 lowest error values are 48.759348 and 60.602101, so from this group you should probably take A_values (this is OK) and E_values (instead of D_values).

So maybe grpProc function instead of:

sort_args = np.argsort(err) < k
cols = models[sort_args]

should contain:

cols = err.nsmallest(k).index

edited Nov 12 '21 at 23:58

ah bon

9,293
12
65
148

answered Nov 12 '21 at 10:22

Valdi_Bo

30,023
4
23
41

Thanks, but I get an error: `KeyError: "None of [Index(['A_values', 'B_values', 'C_values', 'D_values', 'E_values'], dtype='object')] are in the [columns]"`, while runing `df.groupby(pd.Grouper(freq='M')).apply(grpProc)`. – ah bon Nov 12 '21 at 13:43
May i ask did u test the example data? Do u mind post your result? – ah bon Nov 12 '21 at 13:59
1

I created the result from your code sample. Maybe first try my code on just this source data and then check what is the difference between it and your actual source data. – Valdi_Bo Nov 12 '21 at 14:22
Note also that your code overwrites original source data in *df* with the result. Maybe you **first** ran your original code (the processing part), what changed the content of *df* and **then** attempted my code? – Valdi_Bo Nov 12 '21 at 14:37
I works, I edited the question by posting the whole code and result. :) – ah bon Nov 12 '21 at 23:57
To your edit part, we can use `sort_args = np.argsort(err, axis=1)[:k]` to replace `sort_args = np.argsort(err, axis=1) < k` to get the correct result. – ah bon Nov 13 '21 at 02:25
Sorry, I get an error with real data `ValueError: Length mismatch: Expected axis has 0 elements, new values have 10 elements`, since there are NaNs inside. I use `df.groupby(pd.Grouper(freq='M'), dropna=True).apply(grpProc)`, it's still not working. – ah bon Nov 15 '21 at 01:26
1

It seems that you have some "special case" in your real source data, not covered by your test data sample. – Valdi_Bo Nov 15 '21 at 08:08
In real data, I have some `NaN`s in both `mape` and `target` columns, which caused an error for `groupby()`, when I remove those rows, it works. Could I add other parameters to deal with this issue? – ah bon Nov 15 '21 at 08:12
1

Maybe you should call *dropna* as the first step and then call *groupby*. – Valdi_Bo Nov 15 '21 at 09:41

Groupby year-month and drop columns with all NaNs in Python

1 Answers1

Edit

Linked