1

I have a dataframe as follows

df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.45, 2.33, np.nan], 'C': [4, 5, 6], 'D': [4.55, 7.36, np.nan]}) 

I want to replace the missing values i.e. np.nan in generic way. For this I have created a function as follows

def treat_mis_value_nu(df):
    df_nu = df.select_dtypes(include=['number'])
    lst_null_col = df_nu.columns[df_nu.isnull().any()].tolist()
        if len(lst_null_col)>0:
            for i in lst_null_col:
                if df_nu[i].isnull().sum()/len(df_nu[i])>0.10:
                    df_final_nu = df_nu.drop([i],axis=1)
                else:
                    df_final_nu = df_nu[i].fillna(df_nu[i].median(),inplace=True)
    return df_final_nu

When I apply this function as follows

df_final = treat_mis_value_nu(df)

I am getting a dataframe as follows

    A    B  C
 0  1  1.0  4
 1  2  2.0  5
 2  3  NaN  6

So it has actually removed column D correctly, but failed to remove column B. I know in past there have been discussion on this topic (here). Still I might be missing something?

lczapski
  • 4,026
  • 3
  • 16
  • 32
pythondumb
  • 1,187
  • 1
  • 15
  • 30
  • 1
    Can you put a statement, explaining what you want in your `generic` function? – meW Feb 26 '19 at 06:33

2 Answers2

1

Use:

df = pd.DataFrame({'A': [1, 2, 3,5,7], 'B': [1.45, 2.33, np.nan, np.nan, np.nan], 
                   'C': [4, 5, 6,8,7], 'D': [4.55, 7.36, np.nan,9,10],
                   'E':list('abcde')}) 
print (df)
   A     B  C      D  E
0  1  1.45  4   4.55  a
1  2  2.33  5   7.36  b
2  3   NaN  6    NaN  c
3  5   NaN  8   9.00  d
4  7   NaN  7  10.00  e

def treat_mis_value_nu(df):
    #get only numeric columns to dataframe
    df_nu = df.select_dtypes(include=['number'])
    #get only columns with NaNs
    df_nu = df_nu.loc[:, df_nu.isnull().any()]
    #get columns for remove with mean instead sum/len, it is same
    cols_to_drop = df_nu.columns[df_nu.isnull().mean() <= 0.30]
    #replace missing values of original columns and remove above thresh    
    return df.fillna(df_nu.median()).drop(cols_to_drop, axis=1)

print (treat_mis_value_nu(df))
   A  C      D  E
0  1  4   4.55  a
1  2  5   7.36  b
2  3  6   8.18  c
3  5  8   9.00  d
4  7  7  10.00  e
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Hi @jezrael: This is good. Just another related query; If a column in the above example contains `string` then I want to replace `nan` as `blank`. I have changed the last line of the function as `return df.fillna('blank').drop(cols_to_drop, axis=1)`. But this is not working. I.e. I am still getting `nan` in the `df`. Please do let me know how to tackle this. – pythondumb Feb 26 '19 at 10:30
  • @pythondumb - hmm, so not working `return df.fillna(df_nu.median()).drop(cols_to_drop, axis=1).fillna('blank')` ? Maybe need `return df.fillna(df_nu.median()).drop(cols_to_drop, axis=1).replace('NaN', 'blank')` – jezrael Feb 26 '19 at 10:31
  • Let me check. Besides I can't do `df.apply(lambda x:f(x))` operation. Here `f(x)` is nothing but a simple regex as `re.sub('[^A-Za-z0-9]', ' ', text)`. Do I need to convert the particular column (like in your example `E` col) to `astype(str)`? – pythondumb Feb 26 '19 at 10:40
  • @pythondumb - not sure if underatand, can you show it with change data? – jezrael Feb 26 '19 at 11:14
0

I would recommend looking at the sklearn Imputer transformer. I don't think it it can drop columns but it can definetly fill them in a 'generic way' - for example, filling in missing values with the median of the relevant column.

You could use it as such:

from sklearn.preprocessing import Imputer

imputer = Imputer(strategy='median')

num_df = df.values
names = df.columns.values

df_final = pd.DataFrame(imputer.transform(num_df), columns=names)

If you have additional transformations you would like to make you could consider making a transformation Pipeline or could even make your own transformers to do bespoke tasks.

FChm
  • 2,515
  • 1
  • 17
  • 37