Grouping by everything except for one index column in pandas

Question

My data analysis repeatedly falls back on a simple but iffy motif, namely "groupby everything except". Take this multi-index example, df:

                      accuracy  velocity
name condition trial                    
john a         1     -1.403105  0.419850
               2     -0.879487  0.141615
     b         1      0.880945  1.951347
               2      0.103741  0.015548
hans a         1      1.425816  2.556959
               2     -0.117703  0.595807
     b         1     -1.136137  0.001417
               2      0.082444 -1.184703

What I want to do now, for instance, is averaging over all available trials while retaining info about names and conditions. This is easily achieved:

average = df.groupby(level=('name', 'condition')).mean()

Under real-world conditions, however, there's a lot more metadata stored in the multi-index. The index easily spans 8-10 columns per row. So the pattern above becomes quite unwieldy. Ultimately, I'm looking for a "discard" operation; I want to perform an operation that throws out or reduces a single index column. In the case above, that's trial number.

Should I just bite the bullet or is there a more idiomatic way of going about this? This might well be an anti-pattern! I want to build a decent intuition when it comes to the "true pandas way"... Thanks in advance.

unutbu · Accepted Answer · 2014-09-01T13:19:37.013

You could define a helper-function for this:

def allbut(*names):
    names = set(names)
    return [item for item in levels if item not in names]

Demo:

import pandas as pd
levels = ('name', 'condition', 'trial')
names = ('john', 'hans')
conditions = list('ab')
trials = range(1, 3)

idx = pd.MultiIndex.from_product(
    [names, conditions, trials], names=levels)

df = pd.DataFrame(np.random.randn(len(idx), 2),
                      index=idx, columns=('accuracy', 'velocity'))

def allbut(*names):
    names = set(names)
    return [item for item in levels if item not in names]

In [40]: df.groupby(level=allbut('condition')).mean()
Out[40]: 
            accuracy  velocity
trial name                    
1     hans  0.086303  0.131395
      john  0.454824 -0.259495
2     hans -0.234961 -0.626495
      john  0.614730 -0.144183

You can remove more than one level too:

In [53]: df.groupby(level=allbut('name', 'trial')).mean()
Out[53]: 
           accuracy  velocity
condition                    
a         -0.597178 -0.370377
b         -0.126996 -0.037003

This looks good; I've been using something along those lines. Do you reckon the pattern itself is sound? Is there an even more "built-in" way of achieving the same result? — ap3l, Sep 01 '14 at 14:19
There are a bunch of different variations, such as `df.groupby(level=list(set(levels)-{'name'})).mean()`, but I think they all amount to basically the same thing -- especially for a small number of levels. I don't think there is a more "built-in" way, so defining a helper function is the best way to make the code look readable. — unutbu, Sep 01 '14 at 14:23

score 1 · Answer 2 · answered Jan 11 '23 at 13:09

1

In the documentation of groupby, there is an example of how to group by all but one specified column of a multiindex. It uses the .difference method of the index names:

df.groupby(level=df.index.names.difference(['name']))

answered Jan 11 '23 at 13:09

Erik

2,500
2
13
26

Grouping by everything except for one index column in pandas

2 Answers2

Linked