You can create your own transformer object that performs the column selection for you. You'll pass the columns you want to extract as an argument when you put it within your pipeline. By being in your pipeline, it'll get pickled with the rest of your steps.
In order to include this custom transformer, your class needs to inherit from two base sklearn classes: TransformerMixin
and BaseEstimator
. Inheriting from TransformerMixin
gives you the fit_transform
method so long as you define fit
and transform
yourself. Inheriting from BaseEstimator
provides get_params
and set_params
. Since the fit method doesn’t need to do anything but return the object itself, all you really need to do is define the transform method.
Here's an example where you could pass in a list of column names you'd want to extract, assuming your data (X
) is a pandas DataFrame.
from sklearn.base import BaseEstimator, TransformerMixin
class FeatureSelector(BaseEstimator, TransformerMixin):
def __init__(self, feature_names):
self._feature_names = feature_names
def fit(self, X, y = None):
return self
def transform(self, X, y = None):
return X[self._feature_names]
Now that you've got the transformer, you can include it in your pipeline, which can get pickled as you've requested.
As for your requirement to not use FunctionTransformer
, I'm assuming you saw the example here where they define all_but_first_column
globally. With the FeatureSelector
class defined above, you could always move something like all_but_first_column
to within that class as another method.