4

I am struggling with a machine learning project, in which I am trying to combine :

  • a sklearn column transform to apply different transformers to my numerical and categorical features
  • a pipeline to apply my different transformers and estimators
  • a GridSearchCV to search for the best parameters.

As long as I fill-in the parameters of my different transformers manually in my pipeline, the code is working perfectly. But as soon as I try to pass lists of different values to compare in my gridsearch parameters, I am getting all kind of invalid parameter error messages.

Here is my code :

First I divide my features into numerical and categorical

from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder


numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)

Then I create 2 different preprocessing pipelines for numerical and categorical features:

numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),OneHotEncoder(handle_unknown='ignore'))

I combined both into another pipeline, set my parameters, and run my GridSearchCV code

model=make_pipeline(preprocessor, LinearRegression() )

params={
    'columntransformer__numerical_pipeline__knnimputer__n_neighbors':[1,2,3,4,5,6,7]
}

grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=10)
cv = KFold(n_splits=5)
all_accuracies = cross_val_score(grid, X, y, cv=cv,scoring='r2')

I tried different ways to declare the paramaters, but never found the proper one. I always get an "invalid parameter" error message.

Could you please help me understanding what went wrong?

Really a lot of thanks for your support, and take good care!

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77

1 Answers1

2

I am assuming that you might have defined preprocessor as the following,

preprocessor = Pipeline([('numerical_pipeline',numerical_pipeline),
                        ('cat_pipeline', cat_pipeline)])

then you need to change your param name as following:

pipeline__numerical_pipeline__knnimputer__n_neighbors

but, there are couple of other problems with the code:

  1. you don't have to call cross_val_score after performing GridSearchCV. Output of GridSearchCV itself would have the cross validation result for each combination of hyper parameters.

  2. KNNImputer would not work when you data is having string data. You need to apply cat_pipeline before num_pipeline.

Complete example:

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
import pandas as pd  # doctest: +SKIP
X = pd.DataFrame({'city': ['London', 'London', 'Paris', np.nan],
                  'rating': [5, 3, 4, 5]})  # doctest: +SKIP

y = [1,0,1,1]

from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder


numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)

numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),
                            OneHotEncoder(handle_unknown='ignore', sparse=False))
preprocessor = Pipeline([('cat_pipeline', cat_pipeline),
                        ('numerical_pipeline',numerical_pipeline)])
model=make_pipeline(preprocessor, LinearRegression() )

params={
    'pipeline__numerical_pipeline__knnimputer__n_neighbors':[1,2]
}


grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=2)

grid.fit(X, y)
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • 1
    Thanks for your prompt and highly detailed feedback. This was really helpfull and is working perfectly well now. To create the preprocessor pipeline, I used make_pipeline instead of pipeline, and it seems this was the bottleneck. Thanks also for the tips regarding the cat_pipeline and numerical_pipeline. I thought that as the column trnasform was allowing different preprocessing, their order wasn't a matter. Once again thanks a lot, and have a nice day! – Xavier Fournat Jun 12 '20 at 14:13
  • 1
    The order of ColumnTransformer transformers doesn't matter. This answer doesn't include a ColumnTransformer, but another pipeline. – Ben Reiniger Jun 13 '20 at 03:35
  • Ya right. Here using columntransformer wouldn't be useful because we have to apply ohe before knn imputer – Venkatachalam Jun 13 '20 at 08:25
  • @Venkatachalam : There is one other elements I do not understand. in my numerical features, I would like to add additionnal transformers like follow : numerical_pipeline= make_pipeline(KNNImputer(),preprocessing.StandardScaler(),PolynomialFeatures()) But I didn't found the proper way to test the polynomial features in the gridsearchCV parameters. Following your guidence, I added : 'pipeline__numerical_pipeline__polynomialfeatures__degree':[2,3,4] But seems this is not working. Could you please help me in this case also? Thanks a lot – Xavier Fournat Jun 13 '20 at 19:48
  • Sure, could u ask this as a separate question with more details? – Venkatachalam Jun 14 '20 at 01:55
  • But where do numerical_features and cat_features come into the picture later? After defining which columns belong to which type, one should to instruct the respective pipelines to work on the specific feature types. – Fredrik Aug 08 '20 at 16:57