How can I use PCA/SVD in Python for feature selection AND identification?

Question

I'm following Principal component analysis in Python to use PCA under Python, but am struggling with determining which features to choose (i.e. which of my columns/features have the best variance).

When I use scipy.linalg.svd, it automatically sorts my Singular Values, so I can't tell which column they belong to.

Example code:

import numpy as np
from scipy.linalg import svd
M = [
     [1, 1, 1, 1, 1, 1],
     [3, 3, 3, 3, 3, 3],
     [2, 2, 2, 2, 2, 2],
     [9, 9, 9, 9, 9, 9]
]
M = np.transpose(np.array(M))
U,s,Vt = svd(M, full_matrices=False)
print s

Is there a different way to go about this without the Singular Values being sorted?

Update: It looks like this might not be possible, at least according to this post on the Matlab forums: http://www.mathworks.com/matlabcentral/newsreader/view_thread/241607. If anyone knows otherwise, let me know :)

Not sure I understand the question. *M = U S V^T*. Therefore, the largest singular value, `s[0]`, corresponds to the left singular vector `U[:,0]` and the right singular vector `Vt[0,:]`. — Steve Tjoa, Jan 08 '13 at 23:18
@SteveTjoa - I want to know which s[i] value maps to which M[j] vector, assuming that there is a 1-1 mapping. My goal is to do feature selection, but I also want to know which features I'm selecting. — Dolan Antenucci, Jan 08 '13 at 23:32
I now realize that there is no 1-1 mapping between the input and output of PCA. I've clarified this in my answer below. — Dolan Antenucci, Jan 11 '13 at 20:29

Dolan Antenucci · Accepted Answer · 2013-01-11T20:45:55.530

I was under the wrong impression that PCA did feature selection, whereas instead it does feature extraction.

Instead, PCA creates a new series of features, each of which is a combination of the input features.

From PCA, if you really wanted to do feature selection, you could look at the weightings of the input features on the PCA created features. For instance, the matplotlib.mlab.PCA library provides the weights in a property (more on library):

from matplotlib.mlab import PCA
res = PCA(data)
print "weights of input vectors: %s" % res.Wt

Sounds like the feature extraction route is the way to use PCA though.

How can I use PCA/SVD in Python for feature selection AND identification?

1 Answers1

Linked