0

I am trying to build a pipeline in order to perform GridSearchCV to find the best parameters. I already split the data into train and validation and have the following code:

cols = ['home_ownership', "purpose","addr_state",  "application_type", "term"]

column_transformer = make_pipeline(

(OneHotEncoder(categories = cols)),

(OrdinalEncoder(categories = X["grade"])),

"passthrough")


imputer = SimpleImputer(strategy='median')

scaler = StandardScaler()

model = SGDClassifier(loss='log',random_state=42,n_jobs=-1,warm_start=True)

pipeline_sgdlogreg = make_pipeline(imputer, column_transformer, scaler, model)

When I perform GridSearchCV I am getting the follwing error:

"cannot use median strategy with non-numeric data (...)"

I do not understand why am I getting this error. None of the categorical variables have missing values.

I perfoming the follwing: Imputation->Encoding->Scaling-> Modeling

Can anyone shed some light?

1 Answers1

0

I do not understand why am I getting this error. None of the categorical variables have missing values.

Whether it has missing values or not, sklearn is throwing the error once it sees a non-numeric column type and how it's being asked to consider it. It doesn't care if values are missing or not when it's doing it's checks, it just knows it wouldn't be able to deal with any that do arise under the given strategy and throws the exception.

"cannot use median strategy with non-numeric data (...)"

Means just what it says. You'll need to create a custom imputer if you want to use it in the pipeline. This question isn't quite the same, but the 2nd place answer outlines a method which will work for you.

Impute categorical missing values in scikit-learn

remsky
  • 121
  • 3