7

I am finding it hard to link the theory with the implementation. I would appreciate help in knowing where my understanding is wrong.

Notations - matrix in bold capital and vectors in bold font small letter

**X** is a dataset on <code>n</code> observations, each of <code>d</code> variables. So, given these observed <code>d</code>-dimensional data vectors, the <code>m</code>-dimensional principal axes are **w**<sub>j</sub>, for j in {1,...,m} where <code>m</code> is the target dimension.

The <code>m</code> principal components of the observed data matrix would be **Y = X W** where matrix **Y** in ℝ<sup> n x m</sup>, matrix **X** in ℝ<sup> n x d</sup>, and matrix **W** in ℝ<sup> d x m</sup>.

Columns of **W** form an orthogonal basis for the <code>m</code> features and the output **y**<sub>n</sub> is the principal component projection that minimizes the squared reconstruction error:

∑<sub>n</sub> ||**x**<sub>n</sub> - **x̂**<sub>n</sub>||<sup>2</sup>
and the optimal reconstruction of **x**<sub>n</sub> is given by **x̂**<sub>n</sub> = **W*y**<sub>n</sub>.

The data model is

X(i,j) = A(i,:)*S(:,j) + noise

where PCA should be done on X to get the output S. S must be equal to Y.

Problem 1: The reduced data Y is not equal to S that is used in the model. Where is my understanding wrong?

Problem 2: How to reconstruct such that the error is minimum?

Please help. Thank you.

   clear all
clc
n1 = 5; %d dimension
n2 = 500; % number of examples

ncomp = 2; % target reduced dimension
%Generating data according to the model
% X(i,j) = A(i,:)*S(:,j) + noise
Ar = orth(randn(n1,ncomp))*diag(ncomp:-1:1);
T = 1:n2;
%generating synthetic data from a dynamical model
S = [ exp(-T/150).*cos( 2*pi*T/50 )
       exp(-T/150).*sin( 2*pi*T/50 ) ];
% Normalizing to zero mean and unit variance
S = ( S - repmat( mean(S,2), 1, n2 ) );
S = S ./ repmat( sqrt( mean( Sr.^2, 2 ) ), 1, n2 );
Xr = Ar * S;
Xrnoise = Xr + 0.2 * randn(n1,n2);

h1 = tsplot(S);


    X = Xrnoise;

XX = X';
[pc, ~] = eigs(cov(XX), ncomp);
Y = XX*pc;

UPDATE [10 Aug]

Based on the Answer, here is the full code that

 clear all
clc
n1 = 5; %d dimension
n2 = 500; % number of examples

ncomp = 2; % target reduced dimension
%Generating data according to the model
% X(i,j) = A(i,:)*S(:,j) + noise
Ar = orth(randn(n1,ncomp))*diag(ncomp:-1:1);
T = 1:n2;
%generating synthetic data from a dynamical model
S = [ exp(-T/150).*cos( 2*pi*T/50 )
       exp(-T/150).*sin( 2*pi*T/50 ) ];
% Normalizing to zero mean and unit variance
S = ( S - repmat( mean(S,2), 1, n2 ) );
S = S ./ repmat( sqrt( mean( S.^2, 2 ) ), 1, n2 );
Xr = Ar * S;
Xrnoise = Xr + 0.2 * randn(n1,n2);

    X = Xrnoise;

XX = X';
[pc, ~] = eigs(cov(XX), ncomp);
Y = XX*pc; %Y are the principal components of X' 
%what you call pc is misleading, these are not the principal components
%These Y columns are orthogonal, and should span the same space 
%as S approximatively indeed (not exactly, since you introduced noise).

%If you want to reconstruct 
%the original data can be retrieved by projecting 
%the principal components back on the original space like this:
Xrnoise_reconstructed = Y*pc';

%Then, you still need to project it through 
%to the S space, if you want to reconstruct S
S_reconstruct = Ar'*Xrnoise_reconstructed';


plot(1:length(S_reconstruct),S_reconstruct,'r')
hold on
 plot(1:length(S),S)

The plot is plot which is very different from the one that is shown in the Answer. Only one component of S exactly matches with that of S_reconstructed. Shouldn't the entire original 2 dimensional space of the source input S be reconstructed? Even if I cut off the noise, then also onely one component of S is exactly reconstructed.

SKM
  • 959
  • 2
  • 19
  • 45
  • I guess this question is more suited for the signal processing stackexchange http://dsp.stackexchange.com/ , since it deals more with the theory than with the code implementation – Noel Segura Meraz Aug 04 '16 at 06:43
  • What is `Sr` in this line `S = S ./ repmat( sqrt( mean( Sr.^2, 2 ) ), 1, n2 );` ? – Sardar Usama Aug 04 '16 at 08:38

4 Answers4

3

I see nobody answered your question, so here goes:

What you computed in Y are the principal components of X' (what you call pc is misleading, these are not the principal components). These Y columns are orthogonal, and should span the same space as S approximatively indeed (not exactly, since you introduced noise).

If you want to reconstruct Xrnoise, you must look at the theory (e.g. here) and apply it correctly: the original data can be retrieved by projecting the principal components back on the original space like this:

Xrnoise_reconstructed = Y*pc'

Then, you still need to transform it through pinv(Ar)*Xrnoise_reconstructed, if you want to reconstruct S.

Matches nicely for me: Xrnoise and reconstruction

answer to UPDATE [10 Aug]: (EDITED 12 Aug)

Your Ar matrix does not define an orthonormal basis, and as such, the transpose Ar' is not the reverse transformation. The earlier answer I provided was thus wrong. The answer has been corrected above.

reverse_engineer
  • 4,239
  • 4
  • 18
  • 27
  • Thank you for your reply. I have one doubt : Is the statement Y = XX*pc; incorrect? – SKM Aug 10 '16 at 16:55
  • `Y = XX*pc;` is not a statement, it is an equation that establishes the definition of `Y`. So it is not inherently correct or incorrect. But it effectively computes the principal components, so `Y` represents the first two principal components of `X`. – reverse_engineer Aug 10 '16 at 19:30
  • Could you please have a look at the Update? It contains the full implementation which includes your answer. My plot looks quite different from yours. Only one component of S exactly matches with that of S_reconstructed. (Question 1) Shouldn't the entire original 2 dimensional space of the source input S be reconstructed? Does the implementation and my understanding not correct? (Question 2) Would PCA work on binary data? Can I apply the same program for binary pca? Thank you for your time and effort. – SKM Aug 11 '16 at 04:40
  • @SKM see my updated answer. On your question of binary data, PCA is a technique of linear algebra designed for real or complex matrices. A binary number is a subset of the real or complex number set, so yes it applies, but there might be more adapted techniques for binary data. – reverse_engineer Aug 11 '16 at 07:46
  • Thank you for the updated answer, but it would really help if you can kindly put up the code that explains the proposed strategy since I cannot understand how to get AR_norm ( the last point) . This part is unclear, could you please elaborate? – SKM Aug 12 '16 at 05:38
  • @SKM sorry was mistakenly assuming that `Ar` was an orthonormal basis, but I saw you multiplied it with `diag(ncomp:-1:1)`, which removes that property. Please see my updated answer. – reverse_engineer Aug 12 '16 at 12:54
2

Your understanding is quite right. One of the reasons for somebody to use PCA would be to reduce the dimensionality of the data. The first principal component has the largest sample variance amongst of all the normalized linear combinations of the columns of X. The second principal component has maximum variance subject to being orthogonal to the next one, etc.

One might then do a PCA on a dataset, and decide to cut off the last principal component or several of the last principal components of the data. This is done to reduce the effect of the curse of dimensionality. The curse of dimensionality is a term used to point out the fact that any group of vectors is sparse in a relatively high dimensional space. Conversely, this means that you would need an absurd amount of data to form any model on a fairly high dimension dataset, such as an word histogram of a text document with possibly tens of thousands of dimensions.

In effect a dimensionality reduction by PCA removes components that are strongly correlated. For example let's take a look at a picture:

Cameraman

As you can see, most of the values are almost the same, strongly correlated. You could meld some of these correlated pixels by removing the last principal components. This would reduce the dimensionality of the image, pack it, by removing some of the information in the image.

There is no magic way to determine the best amount of principal components or the best reconstruction that I'm aware of.

Tapio
  • 1,502
  • 1
  • 12
  • 24
2

Forgive me if i am not mathematically rigorous. If we look at the equation: X = A*S we can say that we are taking some two dimensional data and we map it to a 2 dimensional subspace in 5 dimensional space. Were A is some base for that 2 dimensional subspace.

When we solve the PCA problem for X and look at PC (principal compononet) we see that the two big eignvectors (which coresponds to the two largest eignvalues) span the same subspace that A did. (multpily A'*PC and see that for the first three small eignvectors we get 0 which means that the vectors are orthogonal to A and only for the two largest we get values that are different than 0).

So what i think that the reason why we get a different base for this two dimensional space is because X=A*S can be product of some A1 and S1 and also for some other A2 and S2 and we will still get X=A1*S1=A2*S2. What PCA gives us is a particular base that maximize the variance in each dimension.

So how to solve the problem you have? I can see that you chose as the testing data some exponential times sin and cos so i think you are dealing with a specific kind of data. I am not an expert in signal processing but look at MUSIC algorithm.

Amitay Nachmani
  • 3,259
  • 1
  • 18
  • 21
  • Thank you for your reply. My testing data S can be generated from any system. It can also be random. X = A*S +noise (eq1)is the data model and dimensionality reduction on X is done using equation S = AX gives the principal components and is the data in reduced dimensionality. I have 2 Questions, could you please help? – SKM Aug 04 '16 at 16:54
  • S denotes the reduced dimensionality data the actual observed high dimensional data is X whose dimensionality needs to be reduced. It can also be random. X = A*S +noise (eq1)is the data model and dimensionality reduction on X is done using equation S = AX gives the principal components and is the data in reduced dimensionality. I have 2 Questions, could you please help? – SKM Aug 04 '16 at 17:00
  • (1) What is unclear is that in the last line of the implementation where I did Y = XX*pc . Would this statement give Y as the reduced dimensional data that is equal to S that was used in eq1 ? The purpose is because I need to apply the least square estimator to estimate A and S from the model in eq1. After estimation, I should be able to calculate mean square error between S and its estimate. Where is my understanding wrong? – SKM Aug 04 '16 at 17:01
  • (2) Can PCA be done observed high dimensional data that has binary values? – SKM Aug 04 '16 at 17:02
  • I don'y understand why do think that S=Y? lets go over the equations and for the simplicity of things lets neglect the noise. So we have X=A*S and taking the transpose from both sides we get X'=S'*A' also we have Y=X'*PC. plugging eq2 in eq3 we get we get Y=S'*A'*PC therefor if you claim that Y = S' than A'*PC = I and PC = A BUT who said that finding the PC will give you A and not some other orthogonal basis for this subspace?? Maybe i also am missing something. IF you have the paper that you are basing your algorithm on i will be happy to take a look. – Amitay Nachmani Aug 04 '16 at 17:18
  • Regarding (2) i don't know i need to think about it. – Amitay Nachmani Aug 04 '16 at 17:19
  • The paper can be downloaded here https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/bishop-ppca-jrss.pdf I got this approach based on my earlier Question asked http://stats.stackexchange.com/questions/226224/formulating-a-maximum-likelihood-estimation-problem – SKM Aug 04 '16 at 19:22
  • The problem for which I had posted the Question was how to find a set of reduced data points of a high dimensional data set such that the information about the original data in high dimensional space is preserved. If you kindly look at the answer, then X_reduced is my S that I need to estimate. Before I estimate S I was implementing to check if the known S is equal to the output of PCA on X. – SKM Aug 04 '16 at 19:25
-3

You could use the pca function from Statistics toolbox.

coeff = pca(X)

From documentation, each column of coeff contains coefficients for one principal component. So you can reconstruct the observed data X by multiplying with coeff, e.g. X*coeff

Salman
  • 3,761
  • 2
  • 23
  • 34