Matlab: how to find which variables from dataset could be discarded using PCA in matlab?

Question

I am using PCA to find out which variables in my dataset are redundand due to being highly correlated with other variables. I am using princomp matlab function on the data previously normalized using zscore:

[coeff, PC, eigenvalues] = princomp(zscore(x))

I know that eigenvalues tell me how much variation of the dataset covers every principal component, and that coeff tells me how much of i-th original variable is in the j-th principal component (where i - rows, j - columns).

So I assumed that to find out which variables out of the original dataset are the most important and which are the least I should multiply the coeff matrix by eigenvalues - coeff values represent how much of every variable each component has and eigenvalues tell how important this component is. So this is my full code:

[coeff, PC, eigenvalues] = princomp(zscore(x));
e = eigenvalues./sum(eigenvalues);
abs(coeff)/e

But this does not really show anything - I tried it on a following set, where variable 1 is fully correlated with variable 2 (v2 = v1 + 2):

     v1    v2    v3
     1     3     4
     2     4    -1
     4     6     9
     3     5    -2

but the results of my calculations were following:

v1 0.5525
v2 0.5525
v3 0.5264

and this does not really show anything. I would expect the result for variable 2 show that it is far less important than v1 or v3. Which of my assuptions is wrong?

Jonas · Accepted Answer · 2011-09-29T11:39:47.493

EDIT I have completely reworked the answer now that I understand which assumptions were wrong.

Before explaining what doesn't work in the OP, let me make sure we'll have the same terminology. In principal component analysis, the goal is to obtain a coordinate transformation that separates the observations well, and that may make it easy to describe the data , i.e. the different multi-dimensional observations, in a lower-dimensional space. Observations are multidimensional when they're made up from multiple measurements. If there are fewer linearly independent observations than there are measurements, we expect at least one of the eigenvalues to be zero, because e.g. two linearly independent observation vectors in a 3D space can be described by a 2D plane.

If we have an array

x = [    1     3     4
         2     4    -1
         4     6     9
         3     5    -2];

that consists of four observations with three measurements each, princomp(x) will find the lower-dimensional space spanned by the four observations. Since there are two co-dependent measurements, one of the eigenvalues will be near zero, since the space of measurements is only 2D and not 3D, which is probably the result you wanted to find. Indeed, if you inspect the eigenvectors (coeff), you find that the first two components are extremely obviously collinear

coeff = princomp(x)
coeff =
      0.10124      0.69982      0.70711
      0.10124      0.69982     -0.70711
       0.9897     -0.14317   1.1102e-16

Since the first two components are, in fact, pointing in opposite directions, the values of the first two components of the transformed observations are, on their own, meaningless: [1 1 25] is equivalent to [1000 1000 25].

Now, if we want to find out whether any measurements are linearly dependent, and if we really want to use principal components for this, because in real life, measurements my not be perfectly collinear and we are interested in finding good vectors of descriptors for a machine-learning application, it makes a lot more sense to consider the three measurements as "observations", and run princomp(x'). Since there are thus three "observations" only, but four "measurements", the fourth eigenvector will be zero. However, since there are two linearly dependent observations, we're left with only two non-zero eigenvalues:

eigenvalues =
       24.263
       3.7368
            0
            0

To find out which of the measurements are so highly correlated (not actually necessary if you use the eigenvector-transformed measurements as input for e.g. machine learning), the best way would be to look at the correlation between the measurements:

corr(x)
  ans =
        1            1      0.35675
        1            1      0.35675
  0.35675      0.35675            1

Unsurprisingly, each measurement is perfectly correlated with itself, and v1 is perfectly correlated with v2.

EDIT2

but the eigenvalues tell us which vectors in the new space are most important (cover the most of variation) and also coefficients tell us how much of each variable is in each component. so I assume we can use this data to find out which of the original variables hold the most of variance and thus are most important (and get rid of those that represent small amount)

This works if your observations show very little variance in one measurement variable (e.g. where x = [1 2 3;1 4 22;1 25 -25;1 11 100];, and thus the first variable contributes nothing to the variance). However, with collinear measurements, both vectors hold equivalent information, and contribute equally to the variance. Thus, the eigenvectors (coefficients) are likely to be similar to one another.

In order for @agnieszka's comments to keep making sense, I have left the original points 1-4 of my answer below. Note that #3 was in response to the division of the eigenvectors by the eigenvalues, which to me didn't make a lot of sense.

the vectors should be in rows, not columns (each vector is an observation).
coeff returns the basis vectors of the principal components, and its order has little to do with the original input
To see the importance of the principal components, you use eigenvalues/sum(eigenvalues)
If you have two collinear vectors, you can't say that the first is important and the second isn't. How do you know that it shouldn't be the other way around? If you want to test for colinearity, you should check the rank of the array instead, or call unique on normalized (i.e. norm equal to 1) vectors.

2. wrong; coeff columns that correspond to principal components are in decreasing order BUT the rows correspond to variables in right order and this is the only assumption that I make (coefficients for PC 1 are in column 1 and row 1 corresponds to v1, row 2 to v2 and so on) — agnieszka, Sep 28 '11 at 21:57
what is wrong with 3? eigenvalues tell us about principal components variance covering, making it eigenvalues./sum(eigenvalues) only calculates the percentage — agnieszka, Sep 28 '11 at 21:59
4. if I have two collinear vectors for example height of a person in cm and height of a person in inches I do NOT need both this informations because they actually represent the same feature. they do not add any value for differentiating objects (observations) and i would like to find these variables that after removing from dataset will not significantly impact variation — agnieszka, Sep 28 '11 at 22:01
1. vectors ARE in rows (v1, v2, v3 are variables, not vectors) — agnieszka, Sep 28 '11 at 22:10
@agnieszka PCA does not "remove" variables, as mentioned, if you are interested in finding linear combinations, have a look at the rank. PCA finds a new set of perpendicular axis (formed by the eigenvectors of the cov matrix of your data) to which your data is mapped, combining variables in this process into new ones. Have a look at this [tutorial](http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf). — Maurits, Sep 28 '11 at 22:23
@agnieszka: I think we have misunderstood one another. Please have a look at my edit. — Jonas, Sep 29 '11 at 01:22
@Mauritus - i know it does not remove variables and i know it finds new vectors than represent the same data in new space. but the eigenvalues tell us which vectors in the new space are most important (cover the most of variation) and also coefficients tell us how much of each variable is in each component. so I assume we can use this data to find out which of the original variables hold the most of variance and thus are most important (and get rid of those that represent small amount) — agnieszka, Sep 29 '11 at 06:51
@agnieszka: Yes, but in your case, `v1` and `v2` contribute equally to the variance in that dimension. Please see my edit #2 for a longer explanation. — Jonas, Sep 29 '11 at 11:41
ok, thanks Jonas, I think I now get what you are saying. After analysing your response I came with some other idea - isn't it this way that when i-th and j-th variable are correlated then i-th and j-th coeff rows will have similar values? because when they are perfectly correlated, the values are identical. also I observed that when the values are highly correlated the coeff values were very similar. I am not sure it this observation is true — agnieszka, Oct 02 '11 at 14:30

Matlab: how to find which variables from dataset could be discarded using PCA in matlab?

1 Answers1

Linked