Scikit learn GridSearchCV with pipeline with custom transformer

Question

I'm trying to perform a GridSearchCV on a pipeline with a custom transformer. The transformer enriches the features "year" and "odometer" polynomially and one hot encodes the rest of the features. The ML model is a simple linear regression model.

custom transformer code:

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder 
from sklearn.preprocessing import PolynomialFeatures

class custom_poly_features(TransformerMixin, BaseEstimator):
    def __init__(self, degree = 2, poly_features = ['year', 'odometer']):
        self.degree_ = degree
        self.poly_features_ = poly_features       
    def fit(self, X, y=None):
        # Return the classifier
        return self
    def transform(self, X, y=None):
        poly_feat = PolynomialFeatures(degree=self.degree_)
        OneHot = OneHotEncoder(sparse=False)

        not_poly_features = list(set(X.columns) - set(self.poly_features_))
        poly = poly_feat.fit_transform(X[self.poly_features_].to_numpy())
        poly = np.hstack([poly, OneHot.fit_transform(X[not_poly_features].to_numpy())])

        return poly
    def get_params(self, deep=True):
        return {"degree": self.degree_, "poly_features": self.poly_features_}

pipeline & gridsearch code:

#create pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

poly_pipeline =  Pipeline(steps=[("cpf", custom_poly_features()), ("lin_reg", LinearRegression(n_jobs=-1))])

#perform gridsearch
from sklearn.model_selection import GridSearchCV
param_grid = {"cpf__degree": [3, 4, 5]}

search = GridSearchCV(poly_pipeline, param_grid, n_jobs=-1, cv=3)
search.fit(X_train_ordinal, y_train)

The custom transformer itself works fine and the pipeline also works (although the score is not great, but that is not the topic here).

poly_pipeline.fit(X_train, y_train).score(X_test, y_test)

Output:
0.543546844381771

However, when I perform the gridsearch, the scores are all nan values:

search.cv_results_

Output:
{'mean_fit_time': array([4.46928191, 4.58259885, 4.55605125]),
 'std_fit_time': array([0.18111937, 0.03305779, 0.02080789]),
 'mean_score_time': array([0.21119197, 0.13816587, 0.11357466]),
 'std_score_time': array([0.09206233, 0.02171508, 0.02127906]),
 'param_custom_poly_features__degree': masked_array(data=[3, 4, 5],
          mask=[False, False, False],
    fill_value='?',
         dtype=object),
 'params': [{'custom_poly_features__degree': 3},
  {'custom_poly_features__degree': 4},
  {'custom_poly_features__degree': 5}],
 'split0_test_score': array([nan, nan, nan]),
 'split1_test_score': array([nan, nan, nan]),
 'split2_test_score': array([nan, nan, nan]),
 'mean_test_score': array([nan, nan, nan]),
 'std_test_score': array([nan, nan, nan]),
 'rank_test_score': array([1, 2, 3])}

Does anyone know what the problem is? The transformer and the pipeline work fine on their own after all.

Hello and welcome to SO I'm not fully certain of reasons of NaNs, but there is an issue in your custom transformer. U see, since transformers like `OneHotEncoder` actually learn patterns during `fit`, they should not to be re-fit during testing. Thus, your transformer's `fit` should also call poly- and one-hot-encoder's `fit`, not just return `self`. Accordingly, your `transform` should call `transform` methods, not `fit_transform`. And if your valid/test set happens to differ dramatically, re-fitted transformations are likely to inflict NaNs, esp. w/ such high-var. models like LR. — Sanjar Adilov, Jan 21 '22 at 08:04

score 1 · Answer 1 · answered Jan 22 '22 at 17:47

To debug searches in general, set error_score='raise', so that you get a full error traceback.

Your issue appears to be data-dependent; I can run this just fine on a custom dataset. That suggests to me that the comment by @Sanjar Adylov not only highlights an important issue, but the issue for your data: the train folds sometimes contain different values in some categorical feature(s) than the test folds, and so the one-hot encodings end up with different numbers of features, and the linear model justifiably breaks.

So the fix there is also as Sanjar says: instantiate, store as attributes, and fit the two transformers and in your fit method, and use their transform methods in your transform method.

You will find there is another big issue: all the scores in cv_results_ are the same. This is because you can't actually set the hyperparameters correctly, because in __init__ you've used mismatching names (degree as the parameter but degree_ as the attribute). Read more in the developer guide. (I think you can get around this by editing set_params similar to how you edited get_params, but it would be much easier to actually rely on the BaseEstimator versions of those and just match the parameter names to the attribute names.)

Also, note that setting a parameter default to a list can have surprising effects. Consider alternatives to the default of poly_features in __init__.

class custom_poly_features(TransformerMixin, BaseEstimator):
    def __init__(self, degree=2, poly_features=['year', 'odometer']):
        self.degree = degree
        self.poly_features = poly_features

    def fit(self, X, y=None):
        self.poly_feat = PolynomialFeatures(degree=self.degree)
        self.onehot = OneHotEncoder(sparse=False)

        self.not_poly_features_ = list(set(X.columns) - set(self.poly_features))

        self.poly_feat.fit(X[self.poly_features])
        self.onehot.fit(X[self.not_poly_features_])

        return self

    def transform(self, X, y=None):
        poly = self.poly_feat.transform(X[self.poly_features])
        poly = np.hstack([poly, self.onehot.transform(X[self.not_poly_features_])
        return poly

There are some additional things you might want to add, like checks for whether poly_features or not_poly_features_ is empty (which would break the corresponding transformer).

Finally, your custom estimator is just doing what a ColumnTransformer is meant to do. I think the only reason to prefer yours is if you need to search over which columns get which treatment; I don't think that's easy to do with a ColumnTransformer.

custom_poly = ColumnTransformer(
    transformers=[('poly', PolynomialFeatures(), ['year', 'odometer'])],
    remainder=OneHotEncoder(),
)

param_grid = {"cpf__poly__degree": [3, 4, 5]}

Thank you Ben and Sanjar. I used your adaptation of my estimator and solved the problem with the different count of unique values in the categorical features (luckily, it was only one feature). — Davis Stöwer, Jan 22 '22 at 20:28

Scikit learn GridSearchCV with pipeline with custom transformer

1 Answers1