0

my code:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
column_names = ["age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hrs-per-week","native-country","income"]

adult_train = pd.read_csv("adult.data",header=None,sep=',\s',na_values=["?"])
adult_train.columns=column_names
adult_train.fillna('NA',inplace=True)

I want the index of the rows which have the value 'NA' in more than one column. Is there an inbuilt method or I have to iterate row wise and check values at each column? here is the snapshot of the data:data

I want index of rows like 398,409(missing values at column B and G) and not of rows like 394(missing value only at column N)

Pratik Kumar
  • 2,211
  • 1
  • 17
  • 41

1 Answers1

5

Use isnull.any(1) or sum to get the boolean mask, then select the rows to get the index i.e

df = pd.DataFrame({'A':[1,2,3,4,5],
               'B' :[np.nan,4,5,np.nan,8],
               'C' :[2,4,np.nan,3,5],
               'D' :[np.nan,np.nan,np.nan,np.nan,5]})

   A    B    C    D
0  1  NaN  2.0  NaN
1  2  4.0  4.0  NaN
2  3  5.0  NaN  NaN
3  4  NaN  3.0  NaN
4  5  8.0  5.0  5.0

# If you want to select rows with nan value from Columns B and C 
df.loc[df[['B','C']].isnull().any(1)].index
Int64Index([0, 2, 3], dtype='int64')

# If you want to rows with more than one nan then
df.loc[df.isnull().sum(1)>1].index
Int64Index([0, 2, 3], dtype='int64')
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108