0

So I have got this Pandas DataFrame with multilevel index for the columns:

   group1    group2    group3
   1    2    1    2    1    2
0  ...  ...  NaN  ...  ...  ...
1  NaN  ...  ...  ...  ...  ...
2  ...  ...  ...  ...  NaN  ...

Now i want to drop the rows where the columns group2 and group3 have NaN values. Which equates to rows 0 and 2 in this instance.

According to my understanding of the documentation this should work:

df.dropna(axis = 'rows', subset = ['group2', 'group3'])

But it does not. Instead I get the error:

KeyError: ['group2', 'group3']

Could someone please point out to me how to properly specify the subset?

Kind regards, Rasmus


Update

So it seems like .dropna() cannot work with mulitlevel column indexes. In the end I went with the less elegant, but workable method suggested, slightly rewritten:

mask_nan = df[['group2', 'group3']].isna().any(axis = 'columns')
df[~mask_nan]    # ~ to negate / flip the boolean values
user2194172
  • 99
  • 1
  • 5

3 Answers3

3

Seems like we can not pass the index level in dropna , so we could do

df.loc[:,['group2', 'group3']].isna().any(1)

Then

df=df[df.loc[:,['group2', 'group3']].isna().any(1)]
BENY
  • 317,841
  • 20
  • 164
  • 234
2

I think this is a similiar question to yours.

import numpy as np

df = df[np.isfinite(df['group2', 'group3'])]

Only the rows where the values are finite are taken into account here.

christheliz
  • 176
  • 2
  • 15
  • Since I pass a total of four columns (two column groups of two a piece) would I not get a boolean array? I would need a Pandas Series, to pass back as an index mask. – user2194172 Feb 05 '20 at 14:56
1

Start from detail. When you run:

idx = pd.IndexSlice
df.loc[:, idx['group2':'group3']]

You will get columns for group2 and group3:

  group2     group3    
       1   2      1   2
0    NaN   3    4.0   5
1    8.0   9   10.0  11
2   14.0  15    NaN  17

Now a more compicated expession:

df.loc[:, idx['group2':'group3']].notnull().all(axis=1)

will display a boolean Series with True where all columns are not null:

0    False
1     True
2    False
dtype: bool

So the code that you need is to use the above code in boolean indexing:

df[df.loc[:, idx['group2':'group3']].notnull().all(axis=1)]

(+ idx = pd.IndexSlice before).

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41