1

Below code is what I have. Seems to work for ?, ' and '' but not for np.NaN. Any suggestions?

Also, I am new to Pandas/Python and hence would like to know if there is a faster way to do this

I am thinking of treating features as suspect if more than X%(say 5%) of the rows have missing values. Any other data sanitization initial checks that you regularly use

for col in df.columns:
  pcnt_missing = df[df[col].isin(['?','',' ',np.NaN])][col].count() * 100.0 / df[col].count()
  if pcnt_missing > 1:
    print(f"Col = {col}, Percent missing ={pcnt_missing:.2f}")
  • I think you should check [pd.DataFrame.dropna method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html). Furthermore, you should try to avoid `for` loops while using `pandas`, because [python is slow](https://stackoverflow.com/questions/4305518/why-is-equivalent-python-code-so-much-slower) and when you use `pandas` methods, they usually run on C. – Felipe Whitaker Oct 19 '20 at 01:02
  • 1
    @FelipeWhitaker ```dropna``` has several issues that don't serve my purpose 1. Wont work for ?, '' & ' '. I can first replace ?, ' ' etc with NaN but that brings issues 2, 3 2. Still have to run some sort of for loop to only ignore features with lots of missing values. I guess I can try out ```df.apply()``` in case ```df``` optimizes things, but I don't see a way to add the problematic features to a list within dropna other passing a complex function. 3. ```thresh``` in ```dropna``` is an absolute value and not a percentage value. Hence, I still have to manually compute it using count – Arvind Swaminathan Oct 19 '20 at 01:34

2 Answers2

1

If you can replace the values ?, '', and ' ' with np.nan, you can easily compute the percentage of missing values by using the sum and the length of the DataFrame. You can replace the missing values with an apply:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [1,2,3,4], 'b': [2, '', '?', 4], 'c': [' ', np.nan, '', 5]})

def replace(x):
    idx = x.isin(['', ' ', '?'])
    x[idx] = np.nan
    return x

replaced = df.apply(replace, axis=1) % Values are replaced here

Now you can compute the percentage of missing values for each column with this:

replaced.isna().sum(axis=0) * 100 / len(replaced)

Output:

a     0.0
b    50.0
c    75.0
dtype: float64
Ricardo Erikson
  • 435
  • 1
  • 4
  • 8
0

Use boolean logic with isna, using @Ricardo Erikson setup:

df = pd.DataFrame({'a': [1,2,3,4], 'b': [2, '', '?', 4], 'c': [' ', np.nan, '', 5]})

(df.isna() | df.isin(['?','',' '])).mean()

Output:

a    0.00
b    0.50
c    0.75
dtype: float64

Check for NaN with isna and use |, OR boolean operator, and the use isin, plus you can use mean to find the percentage missing.

Scott Boston
  • 147,308
  • 15
  • 139
  • 187