1

I have a DataFrame with a 3-level MultiIndex, for example:

df = pd.DataFrame({
    'col0': [0,8,3,1,2,2,0,0],
    'col1': range(8),
}, index=pd.MultiIndex.from_product([[0,1]] * 3, names=['idx0', 'idx1', 'idx2']))
>>> df
                col0  col1
idx0 idx1 idx2            
0    0    0        0     0
          1        8     1
     1    0        3     2
          1        1     3
1    0    0        2     4
          1        2     5
     1    0        0     6
          1        0     7

For each idx0, I want to find the idx1 that has the lowest mean of col0. This gives me idx0, idx1 pairs. Then I'd like to select all the rows matching those pairs.

In the example above, the pairs are [(0, 1), (1, 1)] (with means 2 and 0, respectively) and the desired result is:

                col0  col1
idx0 idx1 idx2            
0    1    0        3     2
          1        1     3
1    1    0        0     6
          1        0     7

What I have tried

Step 1: Group by idx0, idx1 and calculate the mean of col0:

mean_col0 = df.groupby(['idx0', 'idx1'])['col0'].mean()
>>> mean_col0
idx0  idx1
0     0       4.0
      1       2.0
1     0       2.0
      1       0.0

Step 2: Select the indexmin (idx1) by group of idx0:

level_idxs = mean_col0.groupby('idx0').idxmin()
>>> level_idxs
idx0
0    (0, 1)
1    (1, 1)

Step 3: Use that to filter the original dataframe.

That's the main problem. When I simply try df.loc[ix], I get a ValueError due to shape mismatch. I would need the third index value or a wildcard.

I think I have a solution. Putting it all together with the steps above:

mean_col0 = df.groupby(['idx0', 'idx1'])['col0'].mean() 
level_idxs = mean_col0.groupby(["idx0"]).idxmin()
result = df[df.index.droplevel(2).isin(level_idxs)]

But it seems quite complicated. Is there a better way?

Pierre D
  • 24,012
  • 7
  • 60
  • 96
Leon0402
  • 160
  • 1
  • 9
  • Give me a moment – Leon0402 Aug 20 '22 at 10:53
  • Doesn't really matter. I need to find the minimum of the average blocks in one column. I added some sample code – Leon0402 Aug 20 '22 at 11:06
  • I found a solution. It's not great, so I'm leaving this open in case someone has something less verbose to solve this. – Leon0402 Aug 20 '22 at 11:16
  • 1
    I think your solution is pretty good (only 3 lines - 1 per each operation you're doing). By the way, your way of selecting the `level_idxs` is also proposed in [this answer](https://stackoverflow.com/a/53927461/14627505) (end of Question 4 - **"What if I have multiple levels?"**) – Vladimir Fokow Aug 20 '22 at 11:42
  • You could also use a (rather lengthy one-liner) like this: `df.reset_index(level=2).loc[df.groupby(["idx0", "idx1"])["col0"].mean().groupby(["idx0"]).idxmin()].set_index('idx2', append=True)`. Will get you the same result as your own solution. But it would be a lot slower in fact. – ouroboros1 Aug 20 '22 at 12:12
  • There is a step by step example. Two people at least understood it. And there is a minimal real example and the end with a solution. Not sure what I could provide more than that. – Leon0402 Aug 20 '22 at 12:46
  • sure, see my edits (and feel free to revert if you prefer your original question). I tried to clarify and added a minimal reproducible example upfront. Also, your solution is optimal AFAIK. – Pierre D Aug 20 '22 at 14:59
  • Thanks, looks better now. I think the solutions posted in the answers down below is better than mine (see also my comments under this). And I still think there might be an even better solution than this. – Leon0402 Aug 20 '22 at 15:03
  • Depends what you mean by "better". In terms of performance (speed), yours is faster and, I believe, optimal. (The speed up is between 50% to 700%, depending on size). – Pierre D Aug 20 '22 at 15:19
  • In general, readability and performance is important. Can you elaborate why my solution is so much faster? Just to be clear, I'm talking about my solution in the question compared to: `df.groupby('idx0', group_keys=False).apply(lambda g: g.xs(g.groupby('idx1')['col0'].mean().idxmin(), level=1, drop_level=False))` – Leon0402 Aug 20 '22 at 15:35
  • Applying a function (esp. a lambda) is typ. slower than vectorized operations. In terms of readability --totally agree that it is desirable as well as performance. But it is a bit subjective; personally I find both approaches equally readable (and in fact it is by reading your code sample that I understood what you were looking for). – Pierre D Aug 20 '22 at 15:52

1 Answers1

1

You can use .apply().

For each group of idx0: query only those idx1-s which have the smallest mean in col0:

df.groupby('idx0').apply(lambda g:
        g.query(f"idx1 == {g.groupby('idx1')['col0'].mean().idxmin()}")
    ).droplevel(0)

The same can be written in this (hopefully more readable) way:

def f(df):
    chosen_idx1 = df.groupby('idx1')['col0'].mean().idxmin()
    return df.query('idx1 == @chosen_idx1')

df.groupby('idx0').apply(f).droplevel(0)
Vladimir Fokow
  • 3,728
  • 2
  • 5
  • 27
  • It is not quite working as it repeats idx0, so the output will have two columns with idx0 – Leon0402 Aug 20 '22 at 12:25
  • @Leon0402, yes sorry, forgot to add `droplevel(0)` at the end. Updated. Now it should work. – Vladimir Fokow Aug 20 '22 at 12:27
  • I like the general idea of using `apply` here very much. This shows my intent more clearly! I will play a little around it. Maybe it is possible to change the lambda so the drop level is not needed. – Leon0402 Aug 20 '22 at 12:31
  • @Leon0402, seems like `.droplevel(0)`is necessary. Either at the end (like I have), or right before .query(): `g.droplevel(0).query(...)`, or after .query(): `g.query(...).droplevel(0)` – Vladimir Fokow Aug 20 '22 at 13:00
  • 1
    I found another way: You can also do `groupby('idx0', group_keys=False)`, which is better than dropping I think. And instead of query one can also use xs like this: `df.groupby('idx0', group_keys=False).apply(lambda g: g.xs(g.groupby('idx1')['col0'].mean().idxmin(), level=1, drop_level=False)) ` I don't know though if xs or query would be preferable. – Leon0402 Aug 20 '22 at 13:03