1

Let's assume I have a dataset with 5 features, and I want to use features 1, 2, and 5 for training (skipping features 3 and 4). I don't want to change the dataset, since I expect the same 5 features to be fed to the model during prediction. I just want the first step of the preprocessing pipeline to drop features 3 and 4.

Furthermore, I want to be able to pickle/joblib the pipeline object at the end of the training without the pickled object depending on any other object or code to load and run. Therefore, I do not want to use FunctionTransformer, since I will have to write a custom function (to be passed to this transformer) and then pickle and ship it alongside with the pickled model object.

Is there a good way to do it in scikit-learn?

happyhuman
  • 1,541
  • 1
  • 16
  • 30

2 Answers2

0

You can create your own transformer object that performs the column selection for you. You'll pass the columns you want to extract as an argument when you put it within your pipeline. By being in your pipeline, it'll get pickled with the rest of your steps.

In order to include this custom transformer, your class needs to inherit from two base sklearn classes: TransformerMixin and BaseEstimator. Inheriting from TransformerMixin gives you the fit_transform method so long as you define fit and transform yourself. Inheriting from BaseEstimator provides get_params and set_params. Since the fit method doesn’t need to do anything but return the object itself, all you really need to do is define the transform method.

Here's an example where you could pass in a list of column names you'd want to extract, assuming your data (X) is a pandas DataFrame.

from sklearn.base import BaseEstimator, TransformerMixin


class FeatureSelector(BaseEstimator, TransformerMixin):

    def __init__(self, feature_names):
        self._feature_names = feature_names 

    def fit(self, X, y = None):
        return self 

    def transform(self, X, y = None):
        return X[self._feature_names]

Now that you've got the transformer, you can include it in your pipeline, which can get pickled as you've requested.

As for your requirement to not use FunctionTransformer, I'm assuming you saw the example here where they define all_but_first_column globally. With the FeatureSelector class defined above, you could always move something like all_but_first_column to within that class as another method.

Scott McAllister
  • 468
  • 7
  • 12
  • Thanks for your solution. The issue however is that I need to avoid writing any custom Python code and achieve this completely using scikit-learn's library (because joblib or pickle will not properly pickle the custom code if you pickle the pipeline object. See: https://github.com/scikit-learn/scikit-learn/issues/12903). I also think scikit-learn should absolutely add a transformer like yours in their native library (choosing features by their names, or by their positions). – happyhuman Jun 07 '19 at 16:43
  • 1
    I think the root problem raised in that [issue](github.com/scikit-learn/scikit-learn/issues/12903) can be solved by using `dill` to pickle your sklearn pipeline. I've personally done it and there's even a SO post by the `dill` author outlining the pros and cons [here](https://stackoverflow.com/questions/32757656/what-are-the-pitfalls-of-using-dill-to-serialise-scikit-learn-statsmodels-models). If you give me some time, I can update my answer with an end-to-end example using a toy dataset like iris or something. – Scott McAllister Jun 10 '19 at 12:54
0

For future reference there's a workaround to perform this task by using the feature_selection.ColumnSelector from the package mlxtend. It takes the indices of the columns that want to be selected as follows:

from mlxtend.feature_selection import ColumnSelector

...

pipeline = Pipeline(steps=[
    ('selector', ColumnSelector([1,2,3])),
    ('kmeans', KMeans()), 
])

...

Refer to the docs for more information.

josescuderoh
  • 65
  • 10