2

Before imputing I had numerical columns in "X_train": numerical_cols = [col for col in X_train.columns if X_train[col].dtype in ['int64','float64']] numerical_cols

After imputing there are no more numerical columns in the new dataframe "imputed_X_train_missing", all the numerical_cols are now 'object'. This is a potential problem when applying XGBRegressor.

This is my code:

X_valid_missing = X_valid.copy()

my_imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

my_imputer.fit(X_train_missing)
imputed_X_train_missing = pd.DataFrame(my_imputer.transform(X_train_missing))
imputed_X_valid_missing = pd.DataFrame(my_imputer.transform(X_valid_missing))

imputed_X_train_missing.columns = X_train_missing.columns
imputed_X_valid_missing.columns = X_valid_missing.columns ```

2 Answers2

0

This may be seen more as treating the symptom instead of the cause, but you could just change the resulting dtypes to a numeric datatype.

Using astype() - Pandas: convert dtype 'object' to int

Using to_numeric() - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html

0

The problem is the imputer when one of the columns is 'object'. After imputation all the columns result 'object':

import pandas as pd
from sklearn.impute import SimpleImputer

X_train = [['dddd', 2, 3], ['dddd', np.nan, 6], ['dddd', 5, 9]]
X_test = [[np.nan, 2, 3], ['dddd', np.nan, 6], ['dddd', np.nan, 9]]

col_names = ['c1', 'c2', 'c3']

df_x_train = pd.DataFrame(X_train, columns=col_names)
df_x_test = pd.DataFrame(X_test, columns=col_names)
print(df_x_train.info())


RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c1 3 non-null object
1 c2 2 non-null float64
2 c3 3 non-null int64

imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp.fit(df_x_train)
imputed_x_train = pd.DataFrame(imp.transform(df_x_train))
imputed_x_train.dtypes`

Now all the columns result object:

0 object
1 object
2 object
dtype: object```