48

I have a dataframe:

pe_odds[ [ 'EVENT_ID', 'SELECTION_ID', 'ODDS' ] ]
Out[67]: 
     EVENT_ID  SELECTION_ID   ODDS
0   100429300       5297529  18.00
1   100429300       5297529  20.00
2   100429300       5297529  21.00
3   100429300       5297529  22.00
4   100429300       5297529  23.00
5   100429300       5297529  24.00
6   100429300       5297529  25.00

When I use groupby and agg, I get results with a multi-index:

pe_odds.groupby( [ 'EVENT_ID', 'SELECTION_ID' ] )[ 'ODDS' ].agg( [ np.min, np.max ] )
Out[68]: 
                         amin   amax
EVENT_ID  SELECTION_ID              
100428417 5490293        1.71   1.71
          5881623        1.14   1.35
          5922296        2.00   2.00
          5956692        2.00   2.02
100428419 603721         2.44   2.90
          4387436        4.30   6.20
          4398859        1.23   1.35
          4574687        1.35   1.46
          4881396       14.50  19.00
          6032606        2.94   4.20
          6065580        2.70   5.80
          6065582        2.42   3.65
100428421 5911426        2.22   2.52

I have tried using as_index to return the results without the multi_index:

pe_odds.groupby( [ 'EVENT_ID', 'SELECTION_ID' ], as_index=False )[ 'ODDS' ].agg( [ np.min, np.max ], as_index=False )

But it still gives me a multi-index.

I can use .reset_index(), but it is very slow:

pe_odds.groupby( [ 'EVENT_ID', 'SELECTION_ID' ] )[ 'ODDS' ].agg( [ np.min, np.max ] ).reset_index()

pe_odds.groupby( [ 'EVENT_ID', 'SELECTION_ID' ] )[ 'ODDS' ].agg( [ np.min, np.max ] ).reset_index()
Out[69]: 
     EVENT_ID  SELECTION_ID   amin   amax
0   100428417       5490293   1.71   1.71
1   100428417       5881623   1.14   1.35
2   100428417       5922296   2.00   2.00
3   100428417       5956692   2.00   2.02
4   100428419        603721   2.44   2.90
5   100428419       4387436   4.30   6.20

How can I return the results, without the Multi-index, using parameters of the groupby and/or agg function. And without having to resort to using reset_index() ?

smci
  • 32,567
  • 20
  • 113
  • 146
Ginger
  • 8,320
  • 12
  • 56
  • 99

4 Answers4

50

Below call:

>>> gr = df.groupby(['EVENT_ID', 'SELECTION_ID'], as_index=False)
>>> res = gr.agg({'ODDS':[np.min, np.max]})
>>> res
    EVENT_ID SELECTION_ID ODDS     
                          amin amax
0  100429300      5297529   18   25
1  100429300      5297559   30   38

returns a frame with mulit-index columns. If you do not want columns to be multi-index either you may do:

>>> res.columns = list(map(''.join, res.columns.values))
>>> res
    EVENT_ID  SELECTION_ID  ODDSamin  ODDSamax
0  100429300       5297529        18        25
1  100429300       5297559        30        38
behzad.nouri
  • 74,723
  • 18
  • 126
  • 124
  • 3
    With pandas v0.24.0 the `.to_flat_index()` function was introduced to columns. Which slightly changes the command to: `res.columns = ["_".join(col_name).rstrip('_') for col_name in res.columns.to_flat_index()]`. (Note how I join on "_" instead of empty space, to concat first and second level column names using underscores instead of spaces. This feels more pythonic to me, but is purely my personal preference.) – Kim Mar 26 '20 at 16:28
  • This isn't working anymore. – Salih Dec 19 '22 at 09:50
  • @Salih This still works for me (Python 2.0). Note that whatever string `join` is called on will be used to paste the column labels from different levels together. As @Kim mentioned, `'_'.join` is often a nice choice. – Attila the Fun Apr 07 '23 at 17:38
4

It is also possible to remove the multi_index on the columns using a pipe method, set_axis, and chaining (which I believe is more readable).

(
pe_odds
.groupby(by=['EVENT_ID', 'SELECTION_ID'] )
.agg([ np.min, np.max ])
.pipe(lambda x: x.set_axis(x.columns.map('_'.join), axis=1))
)

This is the output w/out reseting the index.

                        ODDS_amin  ODDS_amax
EVENT_ID  SELECTION_ID                      
100429300 5297529            18.0       25.0
100429300 5297559            30.0       38.0
1

I have taken Kim's comment and optimised it (you don't need to use .to_flat_index() at all) into the below code. I believe this is the most pythonic (easy to understand) and elegant approach:

df.columns = ["_".join(col_name).rstrip('_') for col_name in df.columns]   

An example use would be:

>>> df.columns = ["_".join(col_name).rstrip('_') for col_name in df.columns]
>>> df
    EVENT_ID  SELECTION_ID  ODDS_amin  ODDS_amax
0  100429300       5297529        18        25
1  100429300       5297559        30        38
0

Here's a general solution that works on either columns or index (partly inspired by @Manuel Bolívar's answer):

def join_levels(df, delim='_', axis=0):
    return (
        df.set_axis(
            df.axes[axis].map(delim.join),
            axis=axis
        )
    )

Then use it (for example):

res = (
    pe_odds
    .groupby(['EVENT_ID', 'SELECTION_ID'])
    .agg([np.min, np.max])
    .pipe(join_levels, axis=1)
    # ...
)

To keep functions like this organized with namespaces, I like to collect them in an accessor (docs). For example, modifying it slightly:

@pd.api.extensions.register_dataframe_accessor("custom")
class _MyAccessor:
    def __init__(self, pandas_obj: pd.DataFrame):
        self._obj = pandas_obj

    def join_levels(self, delim='_', axis=0):
        return (
            self._obj.set_axis(
                self._obj.axes[axis].map(delim.join),
                axis=axis
            )
        )

Then you can do:

res = (
    pe_odds
    .groupby(['EVENT_ID', 'SELECTION_ID'])
    .agg([np.min, np.max])
    .custom.join_levels(axis=1)
    # ...
)

You can put any accessors in a module (.py file) somewhere. Then just import that module where you're working and all of the accessors are magically available.

Attila the Fun
  • 327
  • 2
  • 13