7

Consider having following sklearn Pipeline:

pipeline = make_pipeline(
    TfidfVectorizer(),
    LinearRegression()
)

I have TfidfVectorizer pretrained, so when I am calling pipeline.fit(X, y) I want only LinearRegression to be fitted and I don't want to refit TfidfVectorizer.

I am able to just apply transformation in advance and fit LinearRegression on transformed data, but in my project I have a lot of transformers in a pipeline, where some of them are pretrained and some aren't, so I am searching for a way of not writing another wrapper around sklearn estimators and stay in a bounds of one Pipeline object.

To my mind, it should be a parameter in the estimators object that stands for not refitting object when calling .fit() if object is already fitted.

Alexander
  • 71
  • 2

3 Answers3

3

Look at "memory" parameter. It caches transformers from a pipeline.

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html

pipeline = make_pipeline(
    TfidfVectorizer(),
    LinearRegression(),
    memory='cache_directory'
)
Ivan Reshetnikov
  • 398
  • 2
  • 12
  • 1
    Thank you for answer, but it is not what I am trying to accomplish. According to https://scikit-learn.org/stable/modules/compose.html#caching-transformers-avoid-repeated-computation, cache is used when parameters and input data are identical to those on what it was fit. But in my case, I want to fit `LinearRegression` on **new data**, while `TfidfVectorizer` was fit on another data and stays same. The whole point is I want to train transformer once, and use it later in a pipelines on a new data without refitting transformer. – Alexander Apr 30 '21 at 10:03
0

You can find only the regressor by defining your pipeline as follows:

pipeline = make_pipeline(steps = [
    ('vectorizer', TfidfVectorizer()),
    ('regressor', LinearRegression())
])

and then

pipeline['regressor']

should give you only the regressor.

Rafa
  • 564
  • 4
  • 12
  • But suppose you have a lot of estimators inside pipeline, calling each of them separately is not very convenient. Also, if you want to optimize pipeline using `GridSearchCV`, you are not able to call each estimator's fit separately. – Alexander Apr 29 '21 at 13:15
  • Try [Voting Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html) if you need separate result. – Memphis Meng Apr 29 '21 at 16:28
0

You can use this hack to fit transformer only once

from sklearn.preprocessing import FunctionTransformer

def fit_once(transformer):
    fitted = [False]

    def func(x):
        if not fitted[0]:
            transformer.fit(x)
            fitted[0] = True
        return transformer.transform(x)

    return FunctionTransformer(func)

pipeline = make_pipeline(
    fit_once(TfidfVectorizer()),
    LinearRegression()
)
Ivan Reshetnikov
  • 398
  • 2
  • 12