8

I'm am doing PCA and I am interested in which original features were most important. Let me illustrate this with an example:

import numpy as np
from sklearn.decomposition import PCA
X = np.array([[1,-1, -1,-1], [1,-2, -1,-1], [1,-3, -2,-1], [1,1, 1,-1], [1,2,1,-1], [1,3, 2,-0.5]])
print(X)

Which outputs:

[[ 1.  -1.  -1.  -1. ]
[ 1.  -2.  -1.  -1. ]
[ 1.  -3.  -2.  -1. ]
[ 1.   1.   1.  -1. ]
[ 1.   2.   1.  -1. ]
[ 1.   3.   2.  -0.5]]

Intuitively, one could already say that feature 1 and feature 4 are not very important due to their low variance. Let's apply pca on this set:

pca = PCA(n_components=2)
pca.fit_transform(X)
comps = pca.components_

Output:

array([[ 0.        ,  0.8376103 ,  0.54436943,  0.04550712],
       [-0.        ,  0.54564656, -0.8297757 , -0.11722679]])

This output represents the importance of each original feature for each of the two principal components (see this for reference). In other words, for the first principal component, feature 2 is most important, then feature 3. For the second principal component, feature 3 looks most important.

The question is, which feature is most important, which one second most etc? Can I use the component_ attribute for this? Or am I wrong and is PCA not the correct method for doing such analyses (and should I use a feature selection method instead)?

RemiDav
  • 463
  • 3
  • 16
Guido
  • 6,182
  • 1
  • 29
  • 50

1 Answers1

8

The component_ attribute is not the right spot to look for feature importance. The loadings in the two arrays (i.e. the two componments PC1 and PC2) tell you how your original matrix is transformed by each feature (taken together, they form a rotational matrix). But they don't tell you how much each component contributes to describing the transformed feature space, so you don't know yet how to compare the loadings across the two components.

However, the answer that you linked actually tells you what to use instead: the explained_variance_ratio_ attribute. This attribute tells you how much of the variance in your feature space is explained by each principal component:

In [5]: pca.explained_variance_ratio_
Out[5]: array([ 0.98934303,  0.00757996])

This means that the first prinicpal component explaines almost 99 percent of the variance. You know from components_ that PC1 has the highest loading for the second feature. It follows, therefore, that feature 2 is the most important feature in your data space. Feature 3 is the next most important feature, as it has the second highest loading in PC1.

In PC2, the absolute loadings are nearly swapped between feature 2 and feature 3. But as PC2 explains next to nothing of the overall variance, this can be neglected.

Community
  • 1
  • 1
Schmuddi
  • 1,995
  • 21
  • 35
  • Could we give a measure for this feature importance of feature 2? Something like 0.9893 * 0.8376? – Guido Feb 28 '17 at 08:02
  • I've never seen anyone use the explained variance and the loadings for that in the way you describe it. What you're doing is basically weighing the loadings by the component's contributions. This is unusual, but it should work. – Schmuddi Feb 28 '17 at 17:41
  • Since you say it is unusual, I'm highly interested in other people's views on this issue – Guido Mar 08 '17 at 13:10
  • 2
    As this question doesn't appear to receive much attention here on SO, you might want to ask about this at https://stats.stackexchange.com (something along the lines of "Can you multiply the factor loadings of a PC by the explained variance of the PC to assess the importance of features in a PCA?"). I'd be interested to see what the knowledgeable people over there have to say about this. – Schmuddi Mar 08 '17 at 13:20
  • Thanks for the advice, the question is now also posted on https://stats.stackexchange.com/questions/266190/most-important-original-features-of-principle-component-analysis – Guido Mar 08 '17 at 13:39
  • @Guido can we please trouble you to update us with your findings, if any – shivam13juna Jan 24 '21 at 07:20