How do I take a grouped pandas dataframe and pull out minimum value for each group?

Question

I have a grouped dataframe consisting of a multilevel index of items (title ord_base7), a snapshot date of when sales forecasts were made, and the different models that made those forecasts along with each model's error (MAPE).

I'd like to get simplify the data frame to the model with the smallest MAPE for each ord_base7/snapshot date.

Below is a sample of the data for one item.

import pandas as pd

df_dict = {'ord_base7': ['100179K',  '100179K',  '100179K',  '100179K',  '100179K',  '100179K',
  '100179K',  '100179K',  '100179K',  '100179K',  '100179K',  '100179K',  '100179K',  '100179K',
  '100179K',  '100179K',  '100179K',  '100179K','100179K',  '100179K',  '100179K',  '100179K',
  '100179K',  '100179K',  '100179K',  '100179K',  '100179K'],
 'snapshot': ['2020-12-31 00:00:00',  '2020-12-31 00:00:00',
  '2020-12-31 00:00:00',  '2020-12-31 00:00:00',
  '2020-12-31 00:00:00',  '2020-12-31 00:00:00',
  '2020-12-31 00:00:00',  '2021-01-31 00:00:00',
  '2021-01-31 00:00:00',  '2021-01-31 00:00:00',
  '2021-01-31 00:00:00',  '2021-01-31 00:00:00',
  '2021-01-31 00:00:00',  '2021-01-31 00:00:00',
  '2021-02-28 00:00:00',  '2021-02-28 00:00:00',
  '2021-02-28 00:00:00',  '2021-02-28 00:00:00',
  '2021-02-28 00:00:00',  '2021-02-28 00:00:00',
  '2021-02-28 00:00:00',  '2021-03-31 00:00:00',
  '2021-03-31 00:00:00',  '2021-03-31 00:00:00',
  '2021-03-31 00:00:00',  '2021-03-31 00:00:00',
  '2021-03-31 00:00:00'],
 'model': ['AutoArima',  'SARIMA',  'STLF',  'TESA',  'TESM',  'lstm_neural_net',  'prophet',
  'AutoArima',  'SARIMA',  'STLF',  'TESA','TESM',  'lstm_neural_net',  'prophet',
  'AutoArima',  'SARIMA',  'STLF',  'TESA',  'TESM',  'lstm_neural_net',  'prophet',
  'AutoArima',  'SARIMA',  'STLF',  'TESA',  'TESM',  'lstm_neural_net'],
 'MAPE': [71.50999999999999,  4439.13,  1.34,  96.31,  100.13000000000001,  70.17999999999999,
  53.25,  18.32,  20227.11,  12.15,  92.25999999999999,  93.14,  12.2,  11.3,  8.99,  20664.37,
  18.68,  15.8,  20.39,  1.69,  25.679999999999996,  3.0700000000000003,  8.68,  24.6,  5.43,
  2.93,  1.87]}

df = pd.DataFrame.from_dict(df_dict)
df.set_index(['ord_base7', 'snapshot', 'model'],inplace = True)

I tried to just pull the min(MAPE) out but that's over the whole dataset. I need to pull it for each snapshot in time for each item so I know what model performed the best at that time.

How can I do this?

Corralien · Answer 1 · 2023-01-20T23:54:13.707

You can use idxmin:

>>> df.loc[df.groupby(level='snapshot')['MAPE'].idxmin()]
                                                MAPE
ord_base7 snapshot            model                 
100179K   2020-12-31 00:00:00 STLF              1.34
          2021-01-31 00:00:00 prophet          11.30
          2021-02-28 00:00:00 lstm_neural_net   1.69
          2021-03-31 00:00:00 lstm_neural_net   1.87

or df.loc[df.groupby(level=['ord_base7', 'snapshot'])['MAPE'].idxmin()] depends on what "each item" means.

Update

You can also use df.groupby(level=['ord_base7', 'snapshot'])['MAPE'].nsmallest(1)

Pythoneer · Answer 2 · 2023-01-21T10:27:12.010

Groupby and use the .min() function along it (you also have to stop setting the index to snapshot as you can't use it in a .groupby() selection):

# ... code before
df.set_index(['ord_base7', 'model'],inplace = True)
print(df.loc[df['MAPE'].isin(df.groupby('snapshot')['MAPE'].min().tolist())])

gives:

                                      snapshot   MAPE
ord_base7 model                                      
100179K   STLF             2020-12-31 00:00:00   1.34
          prophet          2021-01-31 00:00:00  11.30
          lstm_neural_net  2021-02-28 00:00:00   1.69
          lstm_neural_net  2021-03-31 00:00:00   1.87

Why this method is better

Using the datetime module to time this and the other answer, I found out that my method is quicker on average:

My method

from datetime import datetime

start_time = datetime.now()

#... code after
print(df.loc[df['MAPE'].isin(df.groupby('snapshot')['MAPE'].min().tolist())])

end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

Gave:

Duration: 0:00:00.034470

or 0.034 seconds.

Other method

from datetime import datetime

start_time = datetime.now()

#... code after
print(df.loc[df.groupby(level='snapshot')['MAPE'].idxmin()])

end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

Gave:

Duration: 0:00:00.092165

or 0.092 seconds.

So

My method < Other method

  0.034   <    0.092

And to set the snapshot to an index again, change the code to:

print(df.loc[df['MAPE'].isin(df.groupby('snapshot')['MAPE'].min().tolist())].reset_index().set_index(['ord_base7', 'snapshot', 'model']))

I should have mentioned that I want to know what model is giving those mape scores. — JordanBH, Jan 20 '23 at 21:35
You change the `set_index` without the OP's opinion so don't forget to mention he need to use `reset_index` and you don't preserve his original data. — Corralien, Jan 20 '23 at 23:45

How do I take a grouped pandas dataframe and pull out minimum value for each group?

2 Answers2

Why this method is better

My method

Other method

So