85

I got this from the sklearn webpage:

  • Pipeline: Pipeline of transforms with a final estimator

  • Make_pipeline: Construct a Pipeline from the given estimators. This is a shorthand for the Pipeline constructor.

But I still do not understand when I have to use each one. Can anyone give me an example?

cottontail
  • 10,268
  • 18
  • 50
  • 51
Aizzaac
  • 3,146
  • 8
  • 29
  • 61

2 Answers2

139

The only difference is that make_pipeline generates names for steps automatically.

Step names are needed e.g. if you want to use a pipeline with model selection utilities (e.g. GridSearchCV). With grid search you need to specify parameters for various steps of a pipeline:

pipe = Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression()])
param_grid = [{'clf__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)

compare it with make_pipeline:

pipe = make_pipeline(CountVectorizer(), LogisticRegression())     
param_grid = [{'logisticregression__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)

So, with Pipeline:

  • names are explicit, you don't have to figure them out if you need them;
  • name doesn't change if you change estimator/transformer used in a step, e.g. if you replace LogisticRegression() with LinearSVC() you can still use clf__C.

make_pipeline:

  • shorter and arguably more readable notation;
  • names are auto-generated using a straightforward rule (lowercase name of an estimator).

When to use them is up to you :) I prefer make_pipeline for quick experiments and Pipeline for more stable code; a rule of thumb: IPython Notebook -> make_pipeline; Python module in a larger project -> Pipeline. But it is certainly not a big deal to use make_pipeline in a module or Pipeline in a short script or a notebook.

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65
  • Could you tell me where it is documented that the name of `LogisticRegression()`'s estimator is `logisticregression`? I had to set a grid search for `OneVsRestClassifier(LinearSVC())` but I don't know what name refers to it. – KubiK888 Sep 14 '19 at 18:00
  • @KubiK888 it is documented at https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html - "their names will be set to the lowercase of their types automatically" – Mikhail Korobov Sep 16 '19 at 13:38
  • But what about `OneVsRestClassifier(LinearSVC())`, I have tried all of the following: `'onevsrestclassifier_linearsvc__C', onevsrestclassifier_linearsvc_estimator__C', 'onevsrestclassifier__C', 'linearsvc__C', 'onevsrestclassifier__linearsvc__C', 'onevsrestclassifier-linearsvc__C', 'onevsrestclassifier_linearsvc_estimator__C', 'estimator__C'`, they all give me `Check the list of available parameters with "estimator.get_params().keys()"`. – KubiK888 Sep 16 '19 at 15:34
2

If we look at the source code, make_pipeline() creates a Pipeline object, so they are equivalent. As mentioned by @Mikhail Korobov, the only difference is that make_pipeline() doesn't admit estimator names and they are set to the lowercase of their types. In other words, type(estimator).__name__.lower() is used to create estimator names (source). So it's really a simpler form of building a pipeline.

On a related note, to get parameter names you can use get_params(). This is useful if you want to know the parameter names for GridSearch(). The parameter names are created by concatenating the estimator names with their kwargs recursively (e.g. max_iter of a LogisticRegression() is stored as 'logisticregression__max_iter' or C parameter in OneVsRestClassifier(LogisticRegression()) as 'onevsrestclassifier__estimator__C'; the latter because when written using kwargs, it is OneVsRestClassifier(estimator=LogisticRegression())).

from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification

X, y = make_classification()
pipe = make_pipeline(PCA(), LogisticRegression())

print(pipe.get_params())

# {'memory': None,
#  'steps': [('pca', PCA()), ('logisticregression', LogisticRegression())],
#  'verbose': False,
#  'pca': PCA(),
#  'logisticregression': LogisticRegression(),
#  'pca__copy': True,
#  'pca__iterated_power': 'auto',
#  'pca__n_components': None,
#  'pca__n_oversamples': 10,
#  'pca__power_iteration_normalizer': 'auto',
#  'pca__random_state': None,
#  'pca__svd_solver': 'auto',
#  'pca__tol': 0.0,
#  'pca__whiten': False,
#  'logisticregression__C': 1.0,
#  'logisticregression__class_weight': None,
#  'logisticregression__dual': False,
#  'logisticregression__fit_intercept': True,
#  'logisticregression__intercept_scaling': 1,
#  'logisticregression__l1_ratio': None,
#  'logisticregression__max_iter': 100,
#  'logisticregression__multi_class': 'auto',
#  'logisticregression__n_jobs': None,
#  'logisticregression__penalty': 'l2',
#  'logisticregression__random_state': None,
#  'logisticregression__solver': 'lbfgs',
#  'logisticregression__tol': 0.0001,
#  'logisticregression__verbose': 0,
#  'logisticregression__warm_start': False}

# use the params from above to construct param_grid
param_grid = {'pca__n_components': [2, None], 'logisticregression__C': [0.1, 1]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)

best_score = gs.score(X, y)

Circling back to Pipeline vs make_pipeline; Pipeline gives you more flexibility in naming parameters but if you name each estimator using lowercase of its type, then Pipeline and make_pipeline they will both have the same params and steps attributes.

pca = PCA()
lr = LogisticRegression()
make_pipe = make_pipeline(pca, lr)
pipe = Pipeline([('pca', pca), ('logisticregression', lr)])

make_pipe.get_params() == pipe.get_params()   # True
make_pipe.steps == pipe.steps                 # True
cottontail
  • 10,268
  • 18
  • 50
  • 51