1

I have a dataset of size 200*119 i.e. my samples are 200 and the variables/features are 119. I want to use PCA to optimize my feature set by selecting only those features that contribute significantly to classification.

I have understood the concept of PCA but am unable to implement it. I have found out the coeff and score of my data using the pca function.

[coeff, score] = pca(data);

The coeff matrix is of size 119x119 now.

But what do I do with this information? My goal is to find the reduced dataset that can be fed into a classifier. I have gone through the documentation for pcares and even looked at similar questions posted regarding this issue. But I am unable to understand how [residuals, reconstructed]=pcares(data, ndim) will help me "reduce" the size of my dataset. How do I go about choosing ndim parameter?

EDIT

I used the following code to reduce dataset.

B=data;
sigma = cov(B);

%// Find eigenvalues and eigenvectors of the covariance matrix
[A,D] = eig(sigma);
vals = diag(D);

%// Sort their eigenvalues
[~,ind] = sort(abs(vals), 'descend');

%// Rearrange eigenvectors
Asort = A(:,ind);

%// Find mean subtracted data
Bm = bsxfun(@minus, B, mean(B,1));

%// Reproject data onto principal components
Bproject = Bm*Asort;

However, my Bproject is still of the size 200*119

I do not understand this. Please explain.

Jennifer
  • 17
  • 7
  • [This post](https://stackoverflow.com/questions/30353432/dimensionality-reduction-in-matlab/30378275#30378275) is useful to understand what `pcares` does. – m7913d Jun 22 '17 at 13:59
  • I already checked this post. There they have chosen ndim=1. My question is how do I choose the ndim for my data? Can it be anything less than 119? Also when I tried the code given in the above link, I am getting an error that says matrix dimensions must agree for this line `reconstructed = repmat(mean(X,1),n,1) + score(:,1:ndim)*coeff(:,1:ndim)' `; – Jennifer Jun 22 '17 at 14:08
  • `ndim` for `pcares` in this case is the number of principal components to take. `reconstructed` from `pcares` is what you would feed into the classifier. Typically you choose the number of principal components such that the percentage of the variance explained in your data exceeds a certain amount. That is a parameter that is chosen by you but typical values range from 90% to 99%. Please check the duplicate. – rayryeng Jun 22 '17 at 14:16
  • I still don't get it. How can I decide the 90-99% from the coeff matrix? I need a scalar value right? I checked those duplicate answers and I am still confused. I gave ndim as 50 and I am getting an error that says too many input arguments. How to choose the optimum ndim parameter? – Jennifer Jun 22 '17 at 14:33
  • You don't do it from the coefficient matrix, you do it from the **singular values / eigenvalues** from the PCA. Once you extract out the singular values, you find the cumulative sum of each element, then divide by the total sum. Find the point where the cumulative sum is > 90% or 99% or whatever and that point determines the total number of components you need. This post by Amro will help - look at the beginning where the variances are being plotted. https://stackoverflow.com/questions/6691289/how-to-check-whether-the-image-is-compressed-or-not-after-applying-svd-on-that-i/6712532#6712532 – rayryeng Jun 22 '17 at 14:39
  • I added a method based on the relative error [here](https://stackoverflow.com/a/44702639/7621674). – m7913d Jun 22 '17 at 14:44
  • Please check the edit I made to the question and guide. – Jennifer Jun 22 '17 at 14:57
  • @m7913d, when I use the code you provided in the link, I am getting an error that says matrix dimensions don't agree. My data is 200x119 but my coeff matrix and score matrix are both 119x119. So in this line of code `reconstructed = repmat(mean(data,1),n,1) + score(:,1:ndim)*coeff(:,1:ndim)'`; it is showing error that says matrix dimensions don't agree. My n is 200 and my p is 119. How do I overcome this? – Jennifer Jun 22 '17 at 15:06
  • `score` should be `200x119`, i.e. `n x p`. – m7913d Jun 22 '17 at 16:05
  • Okay I fixed this. But my reconstructed matrix is of the same size as the original matrix. Both are 200x119. Is that right? Shouldn't my size reduce? @m7913d – Jennifer Jun 22 '17 at 16:19
  • No, because your `reconstructed` matrix is a reconstruction of your original data using the first `ndim` principal components, which enables you to calculate the relative error. The reduced matrices are `coeff(:,1:ndim)`, i.e. the used principal components, and `score(:,1:ndim)`, i.e. the measured values projected on the principal components or alternatively the coordinates of your measurement in the principal component coordinate system or alternatively your reduced data. – m7913d Jun 22 '17 at 16:21
  • Okay, after doing score(:, 1:ndim), my reduced matrix is of the size 119x119. I am not sure if its right as the examples have reduced from 200 to 119 but my features still are 119. Is that correct? @m7913d – Jennifer Jun 22 '17 at 16:31
  • You can choose your own `ndim` based on the generated graph – m7913d Jun 22 '17 at 16:34
  • How do I generate that graph? I tried `plot(ndims, relativeError(ndim)) `within the loop but it didn't plot anything? I arbitrarily chose ndim as 40 without plotting.@m7913d – Jennifer Jun 22 '17 at 16:40
  • I had another doubt. my coeff matrix is 119x119 and my reduced score matrix is 119x40. But why is my samples decreasing from 200 to 119? Only my feature features should decrease right? I am mapping my 200 examples from a 119 dimensional feature space to a 40 dimensional space. So my number of samples should still remain 200 even in 40 dimensional space right? @m7913d – Jennifer Jun 22 '17 at 16:54
  • Yes, `size(score(:,1:40))` should be `200x40`, which is the case if I run my example. – m7913d Jun 23 '17 at 07:11
  • If you want to perform data reduction, you should only use the `ndim` most relevant eigen vectors. – m7913d Jun 24 '17 at 12:25

0 Answers0