23

I need to use pca to identify the dimensions with the highest variance of a certain set of data. I'm using scikit-learn's pca to do it, but I can't identify from the output of the pca method what are the components of my data with the highest variance. Keep in mind that I don't want to eliminate those dimensions, only identify them.

My data is organized as a matrix with 150 rows of data, each one with 4 dimensions. I'm doing as follow:

pca = sklearn.decomposition.PCA()
pca.fit(data_matrix)

When I print pca.explained_variance_ratio_, it outputs an array of variance ratios ordered from highest to lowest, but it doesn't tell me which dimension from the data they correspond to (I've tried changing the order of columns on my matrix, and the resulting variance ratio array was the same).

Printing pca.components_ gives me a 4x4 matrix (I left the original number of components as argument to pca) with some values I can't understand the meaning of...according to scikit's documentation, they should be the components with the maximum variance (the eigenvectors perhaps?), but no sign of which dimension those values refer to.

Transforming the data doesn't help either, because the dimensions are changed in a way I can't really know which one they were originally.

Is there any way I can get this information with scikit's pca? Thanks

Alberto A
  • 1,160
  • 4
  • 17
  • 35
  • 2
    The first row of ``components_`` is the direction of maximum variance, as the documentation states. I am not entirely sure what is unclear about that. The entries in ``explained_variance_ratio_`` correspond to the rows of ``components_``.How do you mean "no sign of which dimension those values refer to"? – Andreas Mueller Mar 13 '13 at 11:01
  • 2
    Well, my problem is, considering I have 4 dimensions in my data and I want to keep only the dimension with the 2 dimensions with the highest variance, how do I know which dimensions of my data would have been kept if I apply PCA with n_components=2. For example, suppose the second dimension and fourth dimension of my data have the highest variance, but I don't know this. I want to apply PCA and have some way to get this information from the results. Again, I don't need to transform the data! – Alberto A Mar 13 '13 at 16:27

1 Answers1

21

The pca.explained_variance_ratio_ returned are the variances from principal components. You can use them to find how many dimensions (components) your data could be better transformed by pca. You can use a threshold for that (e.g, you count how many variances are greater than 0.5, among others). After that, you can transform the data by PCA using the number of dimensions (components) that are equal to principal components higher than the threshold used. The data reduced to these dimensions are different from the data on dimensions in original data.

you can check the code from this link:

http://scikit-learn.org/dev/tutorial/statistical_inference/unsupervised_learning.html#principal-component-analysis-pca

mad
  • 2,677
  • 8
  • 35
  • 78
  • 3
    Helps, but doesn't solve my problem. I need to know which dimensions of my original data are going to be eliminated when I transform my data with PCA and choose for example n_components=2. In this case, 2 dimensions are going to be eliminated, but knowing which dimensions is my problem. – Alberto A Mar 13 '13 at 16:30
  • 14
    PCA doesn't eliminate dimensions and keeps others from the original data. It transforms your data in a number of dimensions whose data are completely different from the original ones. – mad Mar 13 '13 at 17:15
  • 4
    Yeah, You're right. I've been reading PCA again, and what I want to doesn't make sense because of what you said. Well, I'm accepting your answer! Thanks. – Alberto A Mar 13 '13 at 17:21
  • The 1st PC points in the direction of greatest variance. The index of this vector belonging to its highest value is the dimension of greatest variance. – Ulf Aslak Mar 27 '16 at 19:58
  • @mad thanks a lot for your comment. I realized how PCA works from that. Another question, if I do want to remove features like the OP has asked, what should method should I use? – gokul_uf Apr 21 '16 at 00:55
  • @gokul_uf something that might help you is feature selection: http://scikit-learn.org/stable/modules/feature_selection.html. – Alberto A Apr 27 '16 at 21:46