pandas return index of rows having more than one 'NA' value

Question

my code:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
column_names = ["age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hrs-per-week","native-country","income"]

adult_train = pd.read_csv("adult.data",header=None,sep=',\s',na_values=["?"])
adult_train.columns=column_names
adult_train.fillna('NA',inplace=True)

I want the index of the rows which have the value 'NA' in more than one column. Is there an inbuilt method or I have to iterate row wise and check values at each column? here is the snapshot of the data:

I want index of rows like 398,409(missing values at column B and G) and not of rows like 394(missing value only at column N)

Please see https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples., we need data and expected output — Bharath M Shetty, Jan 14 '18 at 04:55
`adult_train.loc[adult_train.isnull().sum(axis=1) > 1].index` this might help, remove this `adult_train.fillna('NA',inplace=True)` which is inefficient. — Bharath M Shetty, Jan 14 '18 at 04:57
I am using `adult_train.fillna('NA',inplace=True)` so that I can use `adult_train['column name'].value_counts()` to get counts of missng values in that column — Pratik Kumar, Jan 14 '18 at 05:36
Dude just use `value_counts(dropna=False)`. You will miss most of the features over missing values if you replace it with a string. — Bharath M Shetty, Jan 14 '18 at 05:40
If your case `adult_train.loc[adult_train.isnull().sum(axis=1) > 1].index` this is all your are after. Didn't this work? — Bharath M Shetty, Jan 14 '18 at 05:45

score 5 · Accepted Answer · answered Jan 14 '18 at 05:54

Use isnull.any(1) or sum to get the boolean mask, then select the rows to get the index i.e

df = pd.DataFrame({'A':[1,2,3,4,5],
               'B' :[np.nan,4,5,np.nan,8],
               'C' :[2,4,np.nan,3,5],
               'D' :[np.nan,np.nan,np.nan,np.nan,5]})

   A    B    C    D
0  1  NaN  2.0  NaN
1  2  4.0  4.0  NaN
2  3  5.0  NaN  NaN
3  4  NaN  3.0  NaN
4  5  8.0  5.0  5.0

# If you want to select rows with nan value from Columns B and C 
df.loc[df[['B','C']].isnull().any(1)].index
Int64Index([0, 2, 3], dtype='int64')

# If you want to rows with more than one nan then
df.loc[df.isnull().sum(1)>1].index
Int64Index([0, 2, 3], dtype='int64')

Thanks, I didn't know that one can use `sum` on boolean values. — mouwsy, Jan 05 '22 at 09:27

pandas return index of rows having more than one 'NA' value

1 Answers1