1

I have a dataset in python pandas with missing values for the variable Engine_model, but I have other rows with the same information. As I know that

Car_model Engine_Model


BMW 5 type A Renault 21 type B BMW 5 NaN Hyunday Santro type C

For example, in here I have a NaN that should be filled with 'type A', as that information is in the first row. How can I do that? Finding the information to fill NaN knowing that it Engine model is the same for all the cars of the same model?

I have obtained the indixes of the missing values and the car model names of those missing values:

Engine_model_missing_indices = data[data['Engine_mode'].isnull()].index

Carmodel_missing = data.loc[Engine_model_missingindices , 'Car_model']

ljc
  • 11
  • 2

1 Answers1

0

I've found a similar solution, that refers to calculating a mean to impute missing values, based on that a working solution would be something like that:

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
import scipy
from sklearn.base import BaseEstimator, TransformerMixin

example_df = pd.DataFrame({
    'Car_model': ['BMW 5', 'Renault 21', 'BMW 5', 'Hyunday Santro'],
    'Engine_Model': ['type A', 'type B', np.NaN, 'type C']
})


class WithinGroupModeImputer(BaseEstimator, TransformerMixin):
    def __init__(self, group_var):
        self.group_var = group_var

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # the copy leaves the original dataframe intact
        X_ = X.copy()
        for col in X_.columns:
            if X_[col].dtypes == 'object':
                X_.loc[(X[col].isna()) & X_[self.group_var].notna(), col] = X_[self.group_var].map(
                    X_.groupby(self.group_var)[col].agg(lambda x: scipy.stats.mode(x, keepdims=False)[0]))
                X_[col] = X_[col].fillna(X_[col].agg(
                    lambda x: scipy.stats.mode(x, keepdims=False)[0]))
        return X_


imp = WithinGroupModeImputer(group_var='Car_model')
imp.fit(example_df)
imp.transform(example_df)

And the output would be:

Car_model Engine_Model
0 BMW 5 type A
1 Renault 21 type B
2 BMW 5 type A
3 Hyunday Santro type C
Kirill Setdekov
  • 329
  • 1
  • 11