Data imputation with fancyimpute and pandas

Question

I have a large pandas data fame df. It has quite a few missings. Dropping row/or col-wise is not an option. Imputing medians, means or the most frequent values is not an option either (hence imputation with pandas and/or scikit unfortunately doens't do the trick).

I came across what seems to be a neat package called fancyimpute (you can find it here). But I have some problems with it.

Here is what I do:

#the neccesary imports
import pandas as pd
import numpy as np
from fancyimpute import KNN

# df is my data frame with the missings. I keep only floats
df_numeric = = df.select_dtypes(include=[np.float])

# I now run fancyimpute KNN, 
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))

However, df_filled is a single vector somehow, instead of the filled data frame. How do I get a hold of the data frame with imputations?

Update

I realized, fancyimpute needs a numpay array. I hence converted the df_numeric to a an array using as_matrix().

# df is my data frame with the missings. I keep only floats
df_numeric = df.select_dtypes(include=[np.float]).as_matrix()

# I now run fancyimpute KNN, 
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))

The output is a dataframe with the column labels gone missing. Any way to retrieve the labels?

`df_filled.columns = df_numeric.columns` ought to do it. that does look like an interesting package btw — JohnE, Jul 21 '17 at 14:25
I do think so, too! I am a bit disappointed with `pandas fillna()` and `sklearn.preprocessing.Imputer` . I did not come across a situation where I could put them to good use. I think, they would greatly benefit from some more sophisticated ways to imputate/interpolate missing data. — Rachel, Jul 21 '17 at 14:29

score 7 · Answer 1 · answered Jul 21 '17 at 14:17

7

Add the following lines after your code:

df_filled.columns = df_numeric.columns
df_filled.index = df_numeric.index

answered Jul 21 '17 at 14:17

Miriam Farber

18,986
14
61
76

Thank you, Miriam! My head was all filled with finding something in the `fancyimpute` documentation that I forgot about the simple solution. Perfect answer! – Rachel Jul 21 '17 at 14:20

score 5 · Answer 2 · answered Feb 03 '19 at 22:23

I see the frustration with fancy impute and pandas. Here is a fairly basic wrapper using the recursive override method. Takes in and outputs a dataframe - column names intact. These sort of wrappers work well with pipelines.

from fancyimpute import SoftImpute

class SoftImputeDf(SoftImpute):
    """DataFrame Wrapper around SoftImpute"""

    def __init__(self, shrinkage_value=None, convergence_threshold=0.001,
                 max_iters=100,max_rank=None,n_power_iterations=1,init_fill_method="zero",
                 min_value=None,max_value=None,normalizer=None,verbose=True):

        super(SoftImputeDf, self).__init__(shrinkage_value=shrinkage_value, 
                                           convergence_threshold=convergence_threshold,
                                           max_iters=max_iters,max_rank=max_rank,
                                           n_power_iterations=n_power_iterations,
                                           init_fill_method=init_fill_method,
                                           min_value=min_value,max_value=max_value,
                                           normalizer=normalizer,verbose=False)



    def fit_transform(self, X, y=None):

        assert isinstance(X, pd.DataFrame), "Must be pandas dframe"

        for col in X.columns:
            if X[col].isnull().sum() < 10:
                X[col].fillna(0.0, inplace=True)

        z = super(SoftImputeDf, self).fit_transform(X.values)
        return pd.DataFrame(z, index=X.index, columns=X.columns)

score 3 · Answer 3 · answered Jun 27 '19 at 17:25

I really appreciate @jander081's approach, and expanded on it a tiny bit to deal with setting categorical columns. I had a problem where the categorical columns would get unset and create errors during training, so modified the code as follows:

from fancyimpute import SoftImpute
import pandas as pd

class SoftImputeDf(SoftImpute):
    """DataFrame Wrapper around SoftImpute"""

    def __init__(self, shrinkage_value=None, convergence_threshold=0.001,
                 max_iters=100,max_rank=None,n_power_iterations=1,init_fill_method="zero",
                 min_value=None,max_value=None,normalizer=None,verbose=True):

        super(SoftImputeDf, self).__init__(shrinkage_value=shrinkage_value, 
                                           convergence_threshold=convergence_threshold,
                                           max_iters=max_iters,max_rank=max_rank,
                                           n_power_iterations=n_power_iterations,
                                           init_fill_method=init_fill_method,
                                           min_value=min_value,max_value=max_value,
                                           normalizer=normalizer,verbose=False)



    def fit_transform(self, X, y=None):

        assert isinstance(X, pd.DataFrame), "Must be pandas dframe"

        for col in X.columns:
            if X[col].isnull().sum() < 10:
                X[col].fillna(0.0, inplace=True)

        z = super(SoftImputeDf, self).fit_transform(X.values)
        df = pd.DataFrame(z, index=X.index, columns=X.columns)
        cats = list(X.select_dtypes(include='category'))
        df[cats] = df[cats].astype('category')

        # return pd.DataFrame(z, index=X.index, columns=X.columns)
        return df

when i call fit_transform method , what parameters should i passed to it to impute it , i am use CSV file — mayaaa, Dec 27 '19 at 11:51

score 2 · Accepted Answer · edited Mar 30 '18 at 03:38

2

df=pd.DataFrame(data=mice.complete(d), columns=d.columns, index=d.index)

The np.array that is returned by the .complete() method of the fancyimpute object (be it mice or KNN) is fed as the content (argument data=) of a pandas dataframe whose cols and indexes are the same as the original data frame.

edited Mar 30 '18 at 03:38

4b0

21,981
30
95
142

answered Mar 29 '18 at 17:38

NicolasWoloszko

379
4
6

1

Can you please explain this answer? – Sterling Archer Mar 29 '18 at 17:59
Sure. The np.array that is returned by the .complete() method of the fancyimpute object (be it mice or KNN) is fed as the content (argument data=) of a pandas dataframe whose cols and indexes are the same as the original data frame – NicolasWoloszko Mar 29 '18 at 19:46

Data imputation with fancyimpute and pandas

Update

4 Answers4

Linked