6

I have a pandas DataFrame with a MultiIndex of columns:

columns=pd.MultiIndex.from_tuples(
    [(c, i) for c in ['a', 'b'] for i in range(3)])
df = pd.DataFrame(np.random.randn(4, 6),
                  index=[0, 0, 1, 1],
                  columns=columns)
print(df)

          a                             b                    
          0         1         2         0         1         2
0  0.582804  0.753118 -0.900950 -0.914657 -0.333091 -0.965912
0  0.498002 -0.842624  0.155783  0.559730 -0.300136 -1.211412
1  0.727019  1.522160  1.679025  1.738350  0.593361  0.411907
1  1.253759 -0.806279 -2.177582 -0.099210 -0.839822 -0.211349

I want to group by the index, and use the 'min' aggregation on the a columns, and the 'sum' aggregation on the b columns.

I know I can do this by creating a dict that specifies the agg function for each column:

agg_dict = {'a': 'min', 'b': 'sum'}
full_agg_dict = {(c, i): agg_dict[c] for c in ['a', 'b'] for i in range(3)}
print(df.groupby(level=0).agg(full_agg_dict))

          a                             b                    
          0         1         2         0         1         2
0  0.498002 -0.842624 -0.900950 -0.354927 -0.633227 -2.177324
1  0.727019 -0.806279 -2.177582  1.639140 -0.246461  0.200558

Is there a simpler way? It seems like there should be a way to do this with agg_dict without using full_agg_dict.

David Wasserman
  • 551
  • 5
  • 10
  • 3
    I dont know if there's anything simpler. Probably just make the dictionary more flexible (and easier to read) if it doesn't follow a perfect pattern: `{x: agg_dict[x[0]] for x in df.columns}` – ALollz Sep 05 '19 at 18:50

1 Answers1

2

I would use your approach as well. But here's another way that (should) work:

(df.stack(level=1)
   .groupby(level=[0,1])
   .agg({'a':'min','b':'sum'})
   .unstack(-1)
)

For some reason groupby(level=[0,1] doesn't work for me, so I came up with:

(df.stack(level=1)
   .reset_index()
   .groupby(['level_0','level_1'])
   .agg({'a':'min','b':'sum'})
   .unstack('level_1')
)
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • 1
    Reason the first solution does not work for this dataframe is because of the original dataframe's index is a 1D array of shape `(4,)`. If instead a 2D array (`index=[[0, 0, 1, 1]]`) was passed, it would work just fine. – stahamtan Sep 05 '19 at 18:22
  • 1
    @ALollz agreed. That's what I said at the very beginning as well. – Quang Hoang Sep 05 '19 at 18:50
  • 3
    @SIA it's a bug with `stack`. Codes are created incorrectly (which are then used in the groupby) when the index has duplicate values. `stack` currently just uses `new_codes = [np.arange(N).repeat(levsize)]` to generate new codes, which ignores dupes. – user3483203 Sep 05 '19 at 19:25