I am working through the titanic kaggle problem and one of the things I am looking to do is use sklearn's IterativeImputer to fill in my missing values.
I am hitting a roadblock after I run the imputation and generate my "filled" values. I am wondering how best to update the original dataframe with the filled values.
Code:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np
titanic = pd.DataFrame(
{
"PassengerId": [1, 2, 3, 4, 5],
"Survived": [0, 1, 1, 1, 0],
"PClass": ['3', '1', '3', '1', '3'],
"Name": ['Braund, Mr. Owen Harris', 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
'Heikkinen, Miss. Laina', 'Futrelle, Mrs. Jacques Heath (Lily May Peel)', 'Allen, Mr. William Henry'],
"Sex": ['male', 'female', 'female', 'female', 'male'],
"Age": [22, 38, 26, np.nan, 35],
"SibSp": [1, 1, 0, 1, 0],
"Parch": [0, 0, 0, 0, 0],
"Fare": [7.25, 71.2833, 7.925, 53.1, 8.05]
}
)
# Slicing dataframe to feed to imputer
titanic_sliced = titanic.loc[:, ['Age', 'SibSp', 'Parch', 'Fare']]
titanic_sliced.head()
Output of sliced dataset:
Age SibSp Parch Fare
0 22.0 1 0 7.2500
1 38.0 1 0 71.2833
2 26.0 0 0 7.9250
3 NaN 1 0 53.1000
4 35.0 0 0 8.0500
Run imputer with a Random Forest estimator
imp = IterativeImputer(RandomForestRegressor(), max_iter=10, random_state=0)
imputed_titanic = pd.DataFrame(imp.fit_transform(titanic_sliced), columns=titanic_sliced.columns)
imputed_titanic
Output of imputed_titanic:
Age SibSp Parch Fare
0 22.00 1.0 0.0 7.2500
1 38.00 1.0 0.0 71.2833
2 26.00 0.0 0.0 7.9250
3 36.11 1.0 0.0 53.1000
4 35.00 0.0 0.0 8.0500
So now my question is, what is the best way to update the original dataframe with the imputed values?