61

I typically get PCA loadings like this:

pca = PCA(n_components=2)
X_t = pca.fit(X).transform(X)
loadings = pca.components_

If I run PCA using a scikit-learn pipeline:

from sklearn.pipeline import Pipeline
pipeline = Pipeline(steps=[    
('scaling',StandardScaler()),
('pca',PCA(n_components=2))
])
X_t=pipeline.fit_transform(X)

is it possible to get the loadings?

Simply trying loadings = pipeline.components_ fails:

AttributeError: 'Pipeline' object has no attribute 'components_'

(Also interested in extracting attributes like coef_ from pipelines.)

desertnaut
  • 57,590
  • 26
  • 140
  • 166
lmart999
  • 6,671
  • 10
  • 29
  • 37

2 Answers2

110

Did you look at the documentation: http://scikit-learn.org/dev/modules/pipeline.html I feel it is pretty clear.

Update: in 0.21 you can use just square brackets:

pipeline['pca']

or indices

pipeline[1]

There are two ways to get to the steps in a pipeline, either using indices or using the string names you gave:

pipeline.named_steps['pca']
pipeline.steps[1][1]

This will give you the PCA object, on which you can get components. With named_steps you can also use attribute access with a . which allows autocompletion:

pipeline.names_steps.pca.<tab here gives autocomplete>

Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
  • 1
    Right, thanks. Didn't that (use of `named_steps`) in the [doc here](http://scikit-learn.org/dev/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline). Appreciate that. – lmart999 Mar 03 '15 at 23:52
  • I would like to hijack this answer by adding that, if you have a `regr = TransformedTargetRegressor` over your pipeline then the syntax is not the same, instead you have to access the regressor using `regressor_` before you access the named steps i.e. `regr.regressor_.named_steps['pca'].components_`. – Ari Cooper-Davis Nov 11 '19 at 14:03
  • Wierd it isn't on the docs page but in the `user guide` present in that docs though. – agent18 Jan 11 '21 at 13:17
  • @agent18 Where was it missing? Maybe open an issue (or better yet a PR) to sklearn to update the docs :) – Andreas Mueller Feb 20 '21 at 01:22
7

Using Neuraxle

Working with pipelines is simpler using Neuraxle. For instance, you can do this:

from neuraxle.pipeline import Pipeline

# Create and fit the pipeline: 
pipeline = Pipeline([
    StandardScaler(),
    PCA(n_components=2)
])
pipeline, X_t = pipeline.fit_transform(X)

# Get the components: 
pca = pipeline[-1]
components = pca.components_

You can access your PCA these three different ways as wished:

  • pipeline['PCA']
  • pipeline[-1]
  • pipeline[1]

Neuraxle is a pipelining library built on top of scikit-learn to take pipelines to the next level. It allows easily managing spaces of hyperparameter distributions, nested pipelines, saving and reloading, REST API serving, and more. The whole thing is made to also use Deep Learning algorithms and to allow parallel computing.

Nested pipelines:

You could have pipelines within pipelines as below.

# Create and fit the pipeline: 
pipeline = Pipeline([
    StandardScaler(),
    Identity(),
    Pipeline([
        Identity(),  # Note: an Identity step is a step that does nothing. 
        Identity(),  # We use it here for demonstration purposes. 
        Identity(),
        Pipeline([
            Identity(),
            PCA(n_components=2)
        ])
    ])
])
pipeline, X_t = pipeline.fit_transform(X)

Then you'd need to do this:

# Get the components: 
pca = pipeline["Pipeline"]["Pipeline"][-1]
components = pca.components_
Guillaume Chevalier
  • 9,613
  • 8
  • 51
  • 79