Principal component analysis

Question

I have to write a classificator (gaussian mixture model) that I use for human action recognition. I have 4 dataset of video. I choose 3 of them as training set and 1 of them as testing set. Before I apply the gm model on the training set I run the pca on it.

pca_coeff=princomp(trainig_data);
score = training_data * pca_coeff;
training_data = score(:,1:min(size(score,2),numDimension));

During the testing step what should I do? Should I execute a new princomp on testing data

new_pca_coeff=princomp(testing_data);
score = testing_data * new_pca_coeff;
testing_data = score(:,1:min(size(score,2),numDimension));

or I should use the pca_coeff that I compute for the training data?

score = testing_data * pca_coeff;
testing_data = score(:,1:min(size(score,2),numDimension));

Richante · Accepted Answer · 2012-05-31T13:16:05.157

The classifier is being trained on data in the space defined by the principle components of the training data. It doesn't make sense to evaluate it in a different space - therefore, you should apply the same transformation to testing data as you did to training data, so don't compute a different pca_coef.

Incidently, if your testing data is drawn independently from the same distribution as the training data, then for large enough training and test sets, the principle components should be approximately the same.

One method for choosing how many principle components to use involves examining the eigenvalues from the PCA decomposition. You can get these from the princomp function like this:

[pca_coeff score eigenvalues] = princomp(data);

The eigenvalues variable will then be an array where each element describes the amount of variance accounted for by the corresponding principle component. If you do:

plot(eigenvalues);

you should see that the first eigenvalue will be the largest, and they will rapidly decrease (this is called a "Scree Plot", and should look like this: http://www.ats.ucla.edu/stat/SPSS/output/spss_output_pca_5.gif, though your one may have up to 800 points instead of 12).

Principle components with small corresponding eigenvalues are unlikely to be useful, since the variance of the data in those dimensions is so small. Many people choose a threshold value, and then select all principle components where the eigenvalue is above that threshold. An informal way of picking the threshold is to look at the Scree plot and choose the threshold to be just after the line 'levels out' - in the image I linked earlier, a good value might be ~0.8, selecting 3 or 4 principle components.

IIRC, you could do something like:

proportion_of_variance = sum(eigenvalues(1:k)) ./ sum(eigenvalues);

to calculate "the proportion of variance described by the low dimensional data".

However, since you are using the principle components for a classification task, you can't really be sure that any particular number of PCs is optimal; the variance of a feature doesn't necessarily tell you anything about how useful it will be for classification. An alternative to choosing PCs with the Scree plot is just to try classification with various numbers of principle components and see what the best number is empirically.

Thanks Richante, your answer is clear and useful. I hane another doubt. How many components I have to use? For each observation I compute 800 features and these are the dimensions of the original data. What is the best choice for the numDimension? Is there a formula that I can use or should I choose it by experimental results? — Mario Lepore, May 30 '12 at 16:08
I've added some information to my original answer to describe how to choose the number of principle components. The short answer is: there isn't really a good formula, choosing by experiment is probably fine. — Richante, May 31 '12 at 13:19
Regarding your last line of code `proportion_of_variance = ...`, Matlab docs calculate this as such: `proportion_of_variance = cumsum(eigenvalues)./sum(eigenvalues)`, mitigating the need for that `k` variable, instead you get a vector and can do a find to find where the threshold is reached. — Unapiedra, Oct 28 '13 at 15:53

Principal component analysis

1 Answers1