1

I am working through the titanic kaggle problem and one of the things I am looking to do is use sklearn's IterativeImputer to fill in my missing values.

I am hitting a roadblock after I run the imputation and generate my "filled" values. I am wondering how best to update the original dataframe with the filled values.

Code:

from sklearn.experimental import enable_iterative_imputer  
from sklearn.impute import IterativeImputer

from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np

titanic = pd.DataFrame(
    {
     "PassengerId": [1, 2, 3, 4, 5],
     "Survived": [0, 1, 1, 1, 0],
     "PClass": ['3', '1', '3', '1', '3'],
     "Name": ['Braund, Mr. Owen Harris', 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
              'Heikkinen, Miss. Laina', 'Futrelle, Mrs. Jacques Heath (Lily May Peel)', 'Allen, Mr. William Henry'],
     "Sex": ['male', 'female', 'female', 'female', 'male'],
     "Age": [22, 38, 26, np.nan, 35],
     "SibSp": [1, 1, 0, 1, 0],
     "Parch": [0, 0, 0, 0, 0],
     "Fare": [7.25, 71.2833, 7.925, 53.1, 8.05]
     }
    )

# Slicing dataframe to feed to imputer
titanic_sliced = titanic.loc[:, ['Age', 'SibSp', 'Parch', 'Fare']]
titanic_sliced.head()

Output of sliced dataset:

        Age  SibSp  Parch     Fare
0  22.0      1      0   7.2500
1  38.0      1      0  71.2833
2  26.0      0      0   7.9250
3   NaN      1      0  53.1000
4  35.0      0      0   8.0500

Run imputer with a Random Forest estimator

imp = IterativeImputer(RandomForestRegressor(), max_iter=10, random_state=0)
imputed_titanic = pd.DataFrame(imp.fit_transform(titanic_sliced), columns=titanic_sliced.columns)
imputed_titanic

Output of imputed_titanic:

       Age  SibSp  Parch     Fare
0  22.00    1.0    0.0   7.2500
1  38.00    1.0    0.0  71.2833
2  26.00    0.0    0.0   7.9250
3  36.11    1.0    0.0  53.1000
4  35.00    0.0    0.0   8.0500

So now my question is, what is the best way to update the original dataframe with the imputed values?

ShuN
  • 65
  • 6
  • 1
    This is not a MRE: your posted code fails to run, as you failed to initialize your data frame. – Prune Feb 08 '21 at 16:17
  • @Prune I'm sure you're probably annoyed with me by now but thanks again. I realized now I was being quite lazy with my initial ask. I've actually learned quite a bit just getting my question into proper format. I ran the updated code and it reproduces exactly what I expect. – ShuN Feb 08 '21 at 17:31
  • That's the idea: learn from mistakes, do better next time. So long as you're trying to improve, we're happy to work with you. – Prune Feb 08 '21 at 21:54

2 Answers2

1

You can't, as given. You destroyed the data for the required location. Instead, you have to maintain that data in some way. I recommend that you transfer the index to a simple data column, where you can recover it later.

Prune
  • 76,765
  • 14
  • 60
  • 81
  • I gave this a shot and it works well but including the index will affect the imputation result, is there a way to do this without considering the index column for imputation? – ShuN Feb 07 '21 at 18:26
  • 1
    We can't tell, and you neglected to include the expected [MRE - Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). When I've run into this problem, I've simply held onto the original data frame, done the imputation without the interfering data, and then matched the result according to *other* values in the original DF. Even more effective is to use an imputation routine that is smart enough to let you specify which columns are involved. – Prune Feb 07 '21 at 18:29
  • Sorry, thought my original post contained the relevant code to reproduce, but as I got more specific I realized I should have included the code. Thanks for the suggestions. As I am still new to this I will do what you suggested, match back using the other values. I would love to get to a point where I can implement your latter suggestion, but for learning I will hold off. – ShuN Feb 07 '21 at 18:33
  • 1
    Check the suggestions for an MRE again. [Include your minimal data frame](https://stackoverflow.com/questions/52413246/how-to-provide-a-reproducible-copy-of-your-dataframe-with-to-clipboard) as part of the example. Your code dies because there's no `train.csv` file. The question dies because you haven't shown the stages of processing. – Prune Feb 07 '21 at 19:01
  • Updated the question based on the suggestions you shared and the MRE guide you linked. Would appreciate any feedback on improvements I can still make to my question. I recognize that you've already gone above and beyond here though, thanks very much for both answering my question and helping me understand how to properly ask questions here. – ShuN Feb 08 '21 at 15:46
1

Thanks to user Prune I was able to figure this out. I am not sure this is the best method but in my instance the solution was quite simple. Because the order of my data before and after imputation were the same, I was able to use the combine_first to update my data while retaining the rest of the non-missing values.

I don't believe this is ideal as this may not always be the case, I would like to further develop this out to make it "smarter" (ie. be able to specify which values to impute or somehow include an index/join key).

Here is the code I ultimately used:

df3= titanic.combine_first(imputed_titanic)
df3.head()

Output:

 Age Cabin Embarked     Fare                                               Name  Parch  PassengerId  Pclass     Sex  SibSp  Survived            Ticket
0  22.0   NaN        S   7.2500                            Braund, Mr. Owen Harris      0            1       3    male      1         0         A/5 21171
1  38.0   C85        C  71.2833  Cumings, Mrs. John Bradley (Florence Briggs Th...      0            2       1  female      1         1          PC 17599
2  26.0   NaN        S   7.9250                             Heikkinen, Miss. Laina      0            3       3  female      0         1  STON/O2. 3101282
3  35.0  C123        S  53.1000       Futrelle, Mrs. Jacques Heath (Lily May Peel)      0            4       1  female      1         1            113803
4  35.0   NaN        S   8.0500                           Allen, Mr. William Henry      0            5       3    male      0         0            373450
ShuN
  • 65
  • 6