sklearn pipelines with fit_transfrom or predict objects instead of fit objects

Question

This example on sklearn website and this answer to sklearn pipelines on SO uses and talks only about using .fit() or .fit_transform() methods in Pipleines.

But, how do I use .predict or .transfrom methods in Pipelines. let's say I have pre-processed my train data, searched for best hyper-parameters and trained an LightGBM model. I would now like to predict on new data, instead of doing all the aforementioned things manually, I want to do them all one-after-one, according to the definition:

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.

But, I only want to implement .transform methods on my validation(or test) data and some more functions(or classes) that take pandas Series(or DataFrame or numpy array) and return processed one, then finally implement .predict method of my LightGBM, which would use the hyper-parameters I already have.

I currently have nothing, since I don't know how to include methods of classes properly( like StandardScaler_instance.transform()) and more such methods.!

How do I do this or what have I missed?

Kim Tang · Accepted Answer · 2020-09-17T09:51:13.900

You have to build your pipeline, which will include the LightGBM model and train the pipeline on your (pre-processed) train data.

With code, it could look like this for example:

import lightgbm
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Create some train and test data
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Define pipeline with scaler and lightgbm model
pipe = Pipeline([('scaler', StandardScaler()), ('lightgbm', lightgbm.LGBMClassifier())])

# Train pipeline
pipe.fit(X_train, y_train)

# Make predictions with pipeline (with lightgbm)
print("Predictions:", pipe.predict(X_test))

# Evaluate pipeline performance
print("Performance score:", pipe.score(X_test, y_test))

Output:

Predictions: [1 0 1 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 1 1 0 1 0 0]
Performance score: 0.84

So to answer your questions:

But, how do I use .predict or .transfrom methods in Pipelines.

You don't have to use .transform, as the pipeline handles the transforms of your input data with the supplied transformers automatically. That's why in the documentation it mentions:

Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods.

You can use .predict as shown in the code example with your test data.

Instead of the StandardScaler I used in this example, you can provide the pipeline with your custom transformer, but it has to implement a .transform() and .fit() method the pipeline can call and the output of the transformer needs to match the required input of the lightgbm model.

Update

You can then provide arguments for different steps of the pipeline as explained in the documentation here:

**fit_paramsdict of string -> object Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

Thank you so much for your time and effort. I have a doubt though. If a have objects whose instance and methods take arguments. for example, I create a lightgbm instance with some arguments, which I presume can be included in Pipleine you have shown above. But, how do if pass arguments to fit method, like validation sets(not test set) and my custom evaluation metric. — Naveen Reddy Marthala, Sep 17 '20 at 09:46
I have this doubt, because I presume pipe.fit() doesn't accept all those arguments and I also wonder how if there are other objects whose .fit() methods take different arguments? is there any way I can specify the different arguments for each step of the pipeline during .fit and .predict or .transform? — Naveen Reddy Marthala, Sep 17 '20 at 09:46
I meant to ask, is there any way I can have a Pipleine use different arguments for different step, for both fit method and predict or transfrom method if applicable. — Naveen Reddy Marthala, Sep 17 '20 at 09:48
Yes, you can use different arguments for each step in the pipeline. Have a look at the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline.fit) to see, how you can provide the arguments for instance during the fit() method call. I updated my answer with that information too. — Kim Tang, Sep 17 '20 at 09:49
I had checked documentation right before writing those comments. the .fit method there wasn't of much help. — Naveen Reddy Marthala, Sep 17 '20 at 09:55

sklearn pipelines with fit_transfrom or predict objects instead of fit objects

1 Answers1