100

I'm trying to recover from a PCA done with scikit-learn, which features are selected as relevant.

A classic example with IRIS dataset.

import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA

# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data
df_norm = (df - df.mean()) / df.std()

# PCA
pca = PCA(n_components=2)
pca.fit_transform(df_norm.values)
print pca.explained_variance_ratio_

This returns

In [42]: pca.explained_variance_ratio_
Out[42]: array([ 0.72770452,  0.23030523])

How can I recover which two features allow these two explained variance among the dataset ? Said diferently, how can i get the index of this features in iris.feature_names ?

In [47]: print iris.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
starball
  • 20,030
  • 7
  • 43
  • 238
sereizam
  • 2,048
  • 3
  • 20
  • 29

5 Answers5

114

This information is included in the pca attribute: components_. As described in the documentation, pca.components_ outputs an array of [n_components, n_features], so to get how components are linearly related with the different features you have to:

Note: each coefficient represents the correlation between a particular pair of component and feature

import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA

# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data
from sklearn import preprocessing
data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns) 

# PCA
pca = PCA(n_components=2)
pca.fit_transform(data_scaled)

# Dump components relations with features:
print(pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1','PC-2']))

      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
PC-1           0.522372         -0.263355           0.581254          0.565611
PC-2          -0.372318         -0.925556          -0.021095         -0.065416

IMPORTANT: As a side comment, note the PCA sign does not affect its interpretation since the sign does not affect the variance contained in each component. Only the relative signs of features forming the PCA dimension are important. In fact, if you run the PCA code again, you might get the PCA dimensions with the signs inverted. For an intuition about this, think about a vector and its negative in 3-D space - both are essentially representing the same direction in space. Check this post for further reference.

Kermit
  • 4,922
  • 4
  • 42
  • 74
Rafa
  • 2,879
  • 2
  • 21
  • 27
  • Components are actually combinations of features, so any particular feature is (at certain degree) correlated with different components.... – Rafa Dec 05 '16 at 12:09
  • 5
    So say you want to know which original feature was most important, should you just take the absolute values and sum them? What I mean is, starting from the last line from the answer: pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1','PC-2']).abs().sum(axis=0), which results in there values: 0.894690 1.188911 0.602349 0.631027. Could we hereby say that sepal width was most important, followed by sepal length? – Guido Feb 23 '17 at 15:17
  • 15
    To understand which features are important you need to pay attention to the correlations. For example, sepal width and PC-2 are strongly correlated (inversely) since the correlation coef is -0.92. In the other hand, petal length and PC-2 are not correlated at all since corr coef is -0.02. So, PC-2 grows as sepal width decreases and PC-2 is independent of changes in petal length. That is, for PC-2 sepal width is important while petal length is not. Same analysis you can conduct for the other variables considering correlation coef is in the interval [-1, 1] – Rafa Feb 24 '17 at 06:49
  • Useful answer! For my purposes I wanted a the dataframe pivoted so that the components are columns. I used `pd.DataFrame(pca.components_.T, index=data_scaled.columns)` – Laura Aug 22 '18 at 19:03
  • If you want to get the `single most important feature name` on a specific PC (or on all PCs) see my answer at the end of this page. – seralouk Jun 23 '19 at 10:18
  • I read that exact part of the documentation earlier today and had no idea that's what it could be used for. – Kermit Sep 26 '20 at 22:51
  • Keep in mind, all PCs are not equal! If you're looking for the 'most important' feature, make sure to also take into account the explained variance of each PC (i.e. you may find it beneficial to weigh each PC's contribution to *actual* dataset variance by `pca.explained_variance_ratio_`) – ntjess Nov 18 '21 at 02:37
61

Edit: as others have commented, you may get same values from .components_ attribute.


Each principal component is a linear combination of the original variables:

pca-coef

where X_is are the original variables, and Beta_is are the corresponding weights or so called coefficients.

To obtain the weights, you may simply pass identity matrix to the transform method:

>>> i = np.identity(df.shape[1])  # identity matrix
>>> i
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

>>> coef = pca.transform(i)
>>> coef
array([[ 0.5224, -0.3723],
       [-0.2634, -0.9256],
       [ 0.5813, -0.0211],
       [ 0.5656, -0.0654]])

Each column of the coef matrix above shows the weights in the linear combination which obtains corresponding principal component:

>>> pd.DataFrame(coef, columns=['PC-1', 'PC-2'], index=df.columns)
                    PC-1   PC-2
sepal length (cm)  0.522 -0.372
sepal width (cm)  -0.263 -0.926
petal length (cm)  0.581 -0.021
petal width (cm)   0.566 -0.065

[4 rows x 2 columns]

For example, above shows that the second principal component (PC-2) is mostly aligned with sepal width, which has the highest weight of 0.926 in absolute value;

Since the data were normalized, you can confirm that the principal components have variance 1.0 which is equivalent to each coefficient vector having norm 1.0:

>>> np.linalg.norm(coef,axis=0)
array([ 1.,  1.])

One may also confirm that the principal components can be calculated as the dot product of the above coefficients and the original variables:

>>> np.allclose(df_norm.values.dot(coef), pca.fit_transform(df_norm.values))
True

Note that we need to use numpy.allclose instead of regular equality operator, because of floating point precision error.

behzad.nouri
  • 74,723
  • 18
  • 126
  • 124
  • 3
    Awesome and exhaustive answer, thank you very much ! – sereizam Apr 10 '14 at 11:56
  • 5
    There's no need for that identity matrix: your `coef` is the same as `pca.components_.T`. scikit-learn estimators always put their learned parameters in public attributes. – Fred Foo Apr 11 '14 at 09:21
  • 5
    Why not directly use `pca.components_`? – exAres Jun 02 '16 at 11:24
  • 2
    Using the identity matrix doesn't work as the inverse transform function adds the empirical mean of each feature. The result gives equal weight (coefficients) to all original variables. (See this [answer](http://stackoverflow.com/a/22141821/3919475)). By using `pca.components_`, you get the right answer. – Rahul Murmuria Jun 17 '16 at 03:21
49

The way this question is phrased reminds me of a misunderstanding of Principle Component Analysis when I was first trying to figure it out. I’d like to go through it here in the hope that others won’t spend as much time on a road-to-nowhere as I did before the penny finally dropped.

The notion of “recovering” feature names suggests that PCA identifies those features that are most important in a dataset. That’s not strictly true.

PCA, as I understand it, identifies the features with the greatest variance in a dataset, and can then use this quality of the dataset to create a smaller dataset with a minimal loss of descriptive power. The advantages of a smaller dataset is that it requires less processing power and should have less noise in the data. But the features of greatest variance are not the "best" or "most important" features of a dataset, insofar as such concepts can be said to exist at all.

To bring that theory into the practicalities of @Rafa’s sample code above:

# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data
from sklearn import preprocessing
data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns) 

# PCA
pca = PCA(n_components=2)
pca.fit_transform(data_scaled)

consider the following:

post_pca_array = pca.fit_transform(data_scaled)

print data_scaled.shape
(150, 4)

print post_pca_array.shape
(150, 2)

In this case, post_pca_array has the same 150 rows of data as data_scaled, but data_scaled’s four columns have been reduced from four to two.

The critical point here is that the two columns – or components, to be terminologically consistent – of post_pca_array are not the two “best” columns of data_scaled. They are two new columns, determined by the algorithm behind sklearn.decomposition’s PCA module. The second column, PC-2 in @Rafa’s example, is informed by sepal_width more than any other column, but the values in PC-2 and data_scaled['sepal_width'] are not the same.

As such, while it’s interesting to find out how much each column in original data contributed to the components of a post-PCA dataset, the notion of “recovering” column names is a little misleading, and certainly misled me for a long time. The only situation where there would be a match between post-PCA and original columns would be if the number of principle components were set at the same number as columns in the original. However, there would be no point in using the same number of columns because the data would not have changed. You would only have gone there to come back again, as it were.

amunnelly
  • 1,144
  • 13
  • 22
21

The important features are the ones that influence the most the components and thus, have a large absolute value/coefficient/loading on the component.

from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)

# 10 samples with 5 features
train_features = np.random.rand(10,5)

model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)

# number of components
n_pcs= model.components_.shape[0]

# get the index of the most important feature on EACH component i.e. largest absolute value
# using LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]

initial_feature_names = ['a','b','c','d','e']

# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]

# using LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i+1): most_important_names[i] for i in range(n_pcs)}

# build the dataframe
df = pd.DataFrame(sorted(dic.items()))

This prints:

     0  1
 0  PC1  e
 1  PC2  d

Conclusion/Explanation:

So on the PC1 the feature named e is the most important and on PC2 the d.

seralouk
  • 30,938
  • 9
  • 118
  • 133
6

Given your fitted estimator pca, the components are to be found in pca.components_, which represent the directions of highest variance in the dataset.

eickenberg
  • 14,152
  • 1
  • 48
  • 52