0

I am working on missing data for a credit risk analysis project. There are missing values in many columns of the Dataframe. Dataframe loan_data is as below:

[IN]: loan_data
[OUT]:

Emp_ID   Emp_Name    City_Name     Salary      Designation  Emp_years Age
  1        A           Delhi       30,00,000      GM          15       45
  2        B           Mumbai       NAN        Clerk          2        22
  3        c           NAN          NAN         Peon          4        18
  4        D           Chennai      7,000       NAN           5        20
  5        E           NAN          NAN          NAN          4        50

and so on....

Now I want the only columns to be displayed should be those where I have NAN values and I want their sum(how many rows have NAN Values

For example,

[IN]: 
def return_loan_data_missing(x):
if (x.isnull().sum()>0):
    return x.isnull().sum()

return_loan_data_missing(loan_data)

[OUT]:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), 
a.item(), a.any() or a.all().

Output desired:

 [OUT]:
 City_Name 2     
 Salary 3
 Designation 2

Right now output I am getting :

 [IN]:
  loan_data.isnull().sum()
 [OUT]:
 Emp_ID 0
 Emp_Name 0
 City_Name 2
 Salary 3
 Designation 2
 Emp_years 0
 Age 0

Please help

noob
  • 3,601
  • 6
  • 27
  • 73
  • what does `if (x.isnull().sum()>0):` mean? do you mean `if (x.isnull().sum()>0).all():` or `if (x.isnull().sum()>0).any():`? – Giacomo Alzetta Oct 31 '19 at 10:36
  • It meant .any() – noob Oct 31 '19 at 10:39
  • Yes, but how can pandas know that you meant `.any()` and not `.all()`. that's what the error is saying. When you deal with array you always have to decide which kind of condition you want to apply. – Giacomo Alzetta Oct 31 '19 at 10:57
  • Does this answer your question? [Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()](https://stackoverflow.com/questions/36921951/truth-value-of-a-series-is-ambiguous-use-a-empty-a-bool-a-item-a-any-o) – AMC Jan 19 '20 at 03:03

2 Answers2

5

You can filter Series for greater 0:

s = loan_data.isnull().sum()
s = s[s > 0]

Or use callable for filtering:

s = loan_data.isnull().sum().loc[lambda x: x > 0]

print (s)
City_Name      2
Salary         3
Designation    2
dtype: int64

Your function should be changed with test if at least one True in mask by Series.any:

def return_loan_data_missing(x):
    s = x.isnull().sum()
    m = s > 0
    if m.any():
        return s[m]
    else:
        return 'No missing value in data.'

print (return_loan_data_missing(loan_data))
City_Name      2
Salary         3
Designation    2
dtype: int64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

We can also do:

df.loc[:,df.isnull().any()].isnull().sum()

Output

City_Name      2
Salary         3
Designation    2
dtype: int64

With DataFrame.count.

(len(df)-df.count()).loc[lambda x: x>0]

City_Name      2
Salary         3
Designation    2
dtype: int64
ansev
  • 30,322
  • 5
  • 17
  • 31