I have a multiindex dataframe in pandas, with 4 columns in the index, and some columns of data. An example is below:
import pandas as pd
import numpy as np
cnames = ['K1', 'K2', 'K3', 'K4', 'D1', 'D2']
rdata = pd.DataFrame(np.random.randint(1, 3, size=(8, len(cnames))), columns=cnames)
rdata.set_index(cnames[:4], inplace=True)
rdata.sortlevel(inplace=True)
print(rdata)
D1 D2
K1 K2 K3 K4
1 1 1 1 1 2
1 1 2
2 1 2 1
2 1 2 2 1
2 1 2 1
2 1 2 2 2 1
2 1 2 1 1
2 1 1
[8 rows x 2 columns]
What I want to do is select the rows where there are exactly 2 values at the K3 level. Not 2 rows, but two distinct values. I've found how to generate a sort of mask for what I want:
filterFunc = lambda x: len(set(x.index.get_level_values('K3'))) == 2
mask = rdata.groupby(level=cnames[:2]).apply(filterFunc)
print(mask)
K1 K2
1 1 True
2 True
2 1 False
2 False
dtype: bool
And I'd hoped that since rdata.loc[1, 2]
allows you to match on just part of the index, it would be possible to do the same thing with a boolean vector like this. Unfortunately, rdata.loc[mask]
fails with IndexingError: Unalignable boolean Series key provided
.
This question seemed similar, but the answer given there doesn't work for anything other than the top level index, since index.get_level_values only works on a single level, not multiple ones.
Following the suggestion here I managed to accomplish what I wanted with
rdata[[mask.loc[k1, k2] for k1, k2, k3, k4 in rdata.index]]
however, both getting the count of distinct values using len(set(index.get_level_values(...)))
and building the boolean vector afterwards by iterating over every row feels more like I'm fighting the framework to achieve something that seems like a simple task in a multiindex setup. Is there a better solution?
This is using pandas 0.13.1.