I'm trying to perform a GridSearchCV on a pipeline with a custom transformer. The transformer enriches the features "year" and "odometer" polynomially and one hot encodes the rest of the features. The ML model is a simple linear regression model.
custom transformer code:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
class custom_poly_features(TransformerMixin, BaseEstimator):
def __init__(self, degree = 2, poly_features = ['year', 'odometer']):
self.degree_ = degree
self.poly_features_ = poly_features
def fit(self, X, y=None):
# Return the classifier
return self
def transform(self, X, y=None):
poly_feat = PolynomialFeatures(degree=self.degree_)
OneHot = OneHotEncoder(sparse=False)
not_poly_features = list(set(X.columns) - set(self.poly_features_))
poly = poly_feat.fit_transform(X[self.poly_features_].to_numpy())
poly = np.hstack([poly, OneHot.fit_transform(X[not_poly_features].to_numpy())])
return poly
def get_params(self, deep=True):
return {"degree": self.degree_, "poly_features": self.poly_features_}
pipeline & gridsearch code:
#create pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
poly_pipeline = Pipeline(steps=[("cpf", custom_poly_features()), ("lin_reg", LinearRegression(n_jobs=-1))])
#perform gridsearch
from sklearn.model_selection import GridSearchCV
param_grid = {"cpf__degree": [3, 4, 5]}
search = GridSearchCV(poly_pipeline, param_grid, n_jobs=-1, cv=3)
search.fit(X_train_ordinal, y_train)
The custom transformer itself works fine and the pipeline also works (although the score is not great, but that is not the topic here).
poly_pipeline.fit(X_train, y_train).score(X_test, y_test)
Output:
0.543546844381771
However, when I perform the gridsearch, the scores are all nan values:
search.cv_results_
Output:
{'mean_fit_time': array([4.46928191, 4.58259885, 4.55605125]),
'std_fit_time': array([0.18111937, 0.03305779, 0.02080789]),
'mean_score_time': array([0.21119197, 0.13816587, 0.11357466]),
'std_score_time': array([0.09206233, 0.02171508, 0.02127906]),
'param_custom_poly_features__degree': masked_array(data=[3, 4, 5],
mask=[False, False, False],
fill_value='?',
dtype=object),
'params': [{'custom_poly_features__degree': 3},
{'custom_poly_features__degree': 4},
{'custom_poly_features__degree': 5}],
'split0_test_score': array([nan, nan, nan]),
'split1_test_score': array([nan, nan, nan]),
'split2_test_score': array([nan, nan, nan]),
'mean_test_score': array([nan, nan, nan]),
'std_test_score': array([nan, nan, nan]),
'rank_test_score': array([1, 2, 3])}
Does anyone know what the problem is? The transformer and the pipeline work fine on their own after all.