2

So in Matlab I perform PCA on handwritten digits. Essentially, I have say 30*30 dimensional pictures, i.e. 900 pixels, and I consider after PCA the components which capture most of the variance, say the first 80 principal components(PC) based on some threshold. Now these 80 PCs are also of 900 dimension, and when i plot these using imshow, I get some images, like something looking like a 0, 6, 3, 5 etc. What is the interpretation of these first few of the PCs (out of the 80 I extracted)?

A. Donda
  • 8,381
  • 2
  • 20
  • 49

2 Answers2

7

First a bit of nomenclature: PCA finds the eigenvectors of the data covariance matrix, and then transforms the data into the basis of eigenvectors. The result of this operation has two parts: The transformed data, and the eigenvectors used for transformation. Usually it is the first that is called principal components (PCs). There is no established name for the second, but I like the term principal modes (PMs). Your question is about the interpretation of the PMs.

The interpretation of PMs is generally hard, if not impossible. Of course they have a simple technical interpretation, PCA looks for directions in the data space along which a maximum amount of variation occurs, and the PMs are those directions. For the first component this maximization is free, and captures the main direction along which variation occurs. Later components are constrained to be orthogonal to the ones before, which often leads to more and more complex, high-frequency patterns which are harder and harder to interpret.

In the case of your data set, the situation may be different, because it probably has some kind of cluster structure in a very high-dimensional space, where the clusters correspond to the digits 0–9. It has been observed that in such a case there is a weak correspondence between PCA and k-means clustering, such that the first PMs tend to recover the space spanned by the cluster centroids. In this case the first PMs will be mixtures of cluster centroid patterns, possibly even approximately coinciding with these patterns. I think this is what explains your observation.


More information in response to the OP's comments:

The Wikipedia snippet cited above refers to Ding and He (2004), K-means Clustering via Principal Component Analysis (link). They write in the abstract: "Here we prove that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering." To my understanding this means that the "component load", the value of the principal component for a given data point is or at least is related to an indicator of whether that data point belongs to a cluster. They continue, "Equivalently, we show that the subspace spanned by the cluster centroids are given by spectral expansion of the data covariance matrix truncated at K − 1 terms." This means that the subspace of the data space (in your case 900-dimensional) that is spanned by the first K – 1 principal modes (eigenvectors) is or is close to the space spanned by the (differences between) the cluster centroids (the mean image for each digit). If this is the case, most of the between-cluster variance is captured by those first principal components.

Think of it this way: The PCA allows to reduce the 900 dimensions of the data to about 10 dimensions, by reconstructing all 30x30 images from a set of 10 "typical" 30x30 images. That means, each image can be approximately encoded by 10 numbers instead of 900 numbers. In order for this to be possible, the "typical" images have to be similar to how a "0", a "1", etc. looks like on average. In the simplest case, the 10 images could just be the average "0", average "1" etc. itself. That is not normally the case, but it can be approximately the case. Does this help? That "0" corresponds to the strongest PC is just coincidence I think.

A. Donda
  • 8,381
  • 2
  • 20
  • 49
  • @A Donda First of all thanks for taking the time to answer my question. I my not entirely clear with what "there is a weak correspondence between PCA and k-means clustering, such that the first PMs tend to recover the space spanned by the cluster centroids" means.Could you elaborate or rephrase please?(the short wiki statement is also unclear to me). Also do you mean by PMs recovering centroids that the PMs are a basis for the cluster centroids? – user3676846 Apr 20 '15 at 10:13
  • @user3676846, I elaborated, I hope this helps. I'm afraid to be even more explicit one would really have to go into the details of the cited paper. – A. Donda Apr 20 '15 at 12:54
  • Thanks a lot for your detailed and wonderful answer. I wanted to know that if my first few PCs are coming out to be the digit 0,3,5,7 etc, then how would this make sense wrt the handwritten digits? I mean that it doesnt make sense to me that some combination of these digits(PCs) will give me my data points, and even if they did then why would 0 have the most weight? – user3676846 Apr 20 '15 at 18:19
  • @user3676846, ok, I tried to explain it less technically, see edit. Does this help? – A. Donda Apr 20 '15 at 18:51
  • This helps a lot! Thank you. What does "more complex, high-frequency patterns " mean? – user3676846 Apr 21 '15 at 17:57
  • @ A Donda. I had another question. Do you think k means clustering in the original raw pixels space vs using the first, say 90 PMs can be such that the PMs one has higher classification accuracy? – user3676846 Apr 22 '15 at 13:14
  • @user3676846, with "high-frequency" I refer to spatial frequencies. The PMs for smaller eigenvalues often have many small features that alternate quickly. Unless the underlying dataset is characterized by such small features, too, these PMs are harder to relate to the original images. – A. Donda Apr 22 '15 at 16:35
  • It is possible that classification in a dimensionality-reduced space is better. That basically depends on how much of the between-cluster variance is captured by those first PMs. But it is still possible that the boundaries between clusters have many small details that are not well captured by the first PMs. So, yes, I think it is likely that that is the case, but it is in no way guaranteed by your observation. – A. Donda Apr 22 '15 at 16:38
  • Regarding your question about why accuracy can get better by dimensionality reduction (on the other answer): Constructing clusters in a high-dimensional space is a harder task than in a low-dimensional space. If those dimensions that are thrown away are not actually needed to distinguish clusters, it is very possible that accuracy gets better, simply because the clustering step itself is easier. These kinds of problems with high-dimensional data are often called "the curse of dimensionality". See [Wikipedia](https://en.wikipedia.org/wiki/Clustering_high-dimensional_data) for more information. – A. Donda Apr 22 '15 at 16:50
  • There should be an outline of a checkmark below the number of upvotes. Click it, and it turns green. Thanks! – A. Donda Apr 25 '15 at 11:18
5

PCA extracts the most important information from the data set and compresses the size of the data set by keeping only the important information - principal components.

The first principal component is constructed in such a way that it has the largest possible variance. The second component is computed under the constraint of being orthogonal to the first component and to have the largest possible variance.

In your case the data is a set of images. Let's say you have 1000 images and you compute the first five principal components (5 images, constructed by the PCA algorithm). You may represent any image as 900 data points (30x30 pixels) or by the combination of 5 images with the corresponding miltiplication coefficients.

The goal of the PCA algorithm is to construct these 5 images (principal componenets) in such a way that images in your data set are represented most accurately with the combination of the given number of principal components.

UPDATE:

Consider the image below (from the amazing book by Kevin Murphy). The image shows how points in 2 dimensions (red dots) are represented in 1 dimension (green crosses) by projecting them to the vector (purple line). The vector is the first principal component. The purpose of PCA is to build these vectors to minimize the reconstruction error. In your case these vectors can be represented as images.

enter image description here

You may refer to this article for more details on using PCA for handwritten digit recognition.

Konstantin
  • 2,937
  • 10
  • 41
  • 58
  • 1
    Excellent answer, except that I still don't know the interpretation of the PCs I get other than that their combination will generate the data in some form. Could you elaborate? – user3676846 Apr 20 '15 at 10:16
  • @user3676846, I added an example to my answer – Konstantin Apr 20 '15 at 17:17
  • Thanks once again! I created my PCA images, and these looked like the digits 0,3,5,7 etc. Does this mean that a combination of these images is what gives me my actual digits? Also since these are the ones which explain maximum variance, I can't understand why 0 would be the most important PC.Could you help? – user3676846 Apr 20 '15 at 18:12
  • Yes, the combination of PCs gives you the actual image, check page 4 [here](http://www.doc.ic.ac.uk/~dfg/ProbabilisticInference/IDAPILecture15.pdf). Zero-look component happens to be quite good in discriminating digits (8, 9, 6 look like 0, but 1,7 don't) – Konstantin Apr 20 '15 at 20:38
  • I had another question. Do you think k means clustering in the original raw pixels space vs using the first, say 90 PMs can be such that the PMs one has higher classification accuracy? – user3676846 Apr 22 '15 at 13:15
  • I did try it, but it seems PCA is giving a better result, which kind of seems wrong because if I use all the features(900) in both the raw pixel and PCA case, I should get the same accuracy, and so if I reduce the number of features in PCA space, my accuracy should decrease. But in my case the accuracy increases. – user3676846 Apr 22 '15 at 14:58
  • It means that PCA with 90 componenets generalizes your data quite good. I see no contradictions here. – Konstantin Apr 22 '15 at 15:16