Calculating new variables from Principal Components - PCA in R

Question

For purpose of learning PCA in R, I have run princomp() function (from MASS package) on iris dataset. I have followed following steps:

 library(MASS)
 irispca<-princomp(iris[-5])
 summary(irispca)
 irispca$loadings

In order to calculate principal components, I have used output of loadings in this way:

 iris_temp2 <- iris
 iris_temp2$Comp.1 <- with(iris_temp2,Sepal.Length*0.361+Petal.Length*0.857+Petal.Width*0.358)
 iris_temp2$Comp.2 <- with(iris_temp2,Sepal.Length*(-0.657)+Sepal.Width*(-0.73)+Petal.Length*0.173)
 iris_temp2$Comp.3 <- with(iris_temp2,Sepal.Length*(-0.582)+Sepal.Width*0.598+Petal.Width*0.546)
 iris_temp2$Comp.4 <- with(iris_temp2,Sepal.Length*0.315+Sepal.Width*(-0.32)+Petal.Length*(-0.48)+Petal.Width*0.754)
 iris_temp2 <- with(iris_temp2, iris_temp2[order(Comp.1,Comp.2,Comp.3,Comp.4),])

At last, I sorted the dataset. I have also come to know that scores gives out the same above thing i.e. scores are calculated by multiplying scaled data (on which you run PCA) with loadings. Hence, I thought of comparing output of scores and output of iris_temp2 (featuring four components).

 iris_temp1 <- as.data.frame(irispca$scores)
 iris_temp1 <- with(iris_temp1, iris_temp1[order(Comp.1,Comp.2,Comp.3,Comp.4),])

However, when I do head(iris_temp1) and head(iris_temp2[,6:9]), the outputs do not match.

I would request you people to point out the reason behind this observation. Is there anything that I have misunderstood? If you need any other input from my end, please do let me know.

Reference materials that I have used are: http://yatani.jp/teaching/doku.php?id=hcistats:pca and https://www.youtube.com/watch?v=I5GxNzKLIoU&spfreload=5.

Thanks Shankar

I don't understand why are you trying manually recalculate principal components that are already given to you by `princomp()`. — mtoto, Mar 27 '16 at 10:35
Thank you for replying. I think, similar names are creating confusion here. Here, Comp.1 is nothing but a new variable that is created in dataset iris_temp2 using loadings. In this case, irispca$loadings will give required loadings output i.e. Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Sepal.Length 0.361 -0.657 -0.582 0.315 Sepal.Width -0.730 0.598 -0.320 Petal.Length 0.857 0.173 -0.480 Petal.Width 0.358 0.546 0.754 — skumar, Mar 27 '16 at 14:58
As it would be clear now, both Comp.1 are different. Hope this helps. Please let me know, if you still have any question. — skumar, Mar 27 '16 at 15:02

score 1 · Accepted Answer · answered Mar 27 '16 at 16:18

1

princomp does not reorder the data, each row is transformed to the scores, so there is no need to reorder the data when comparing. The scores involve both a demeaning of the data and a change of basis by the matrix of eigenvalues.

What this means is that firstly you need to demean your data, i.e.

library(MASS)
irispca<-princomp(iris[-5])

iris2 <- as.matrix(iris[-5])
iris2 <- sweep(iris2, MARGIN=2, irispca$center, FUN="-")

Then it is important to realize that the print method for princomp objects rounds values for display purpose

irispca$loadings

Loadings:
             Comp.1 Comp.2 Comp.3 Comp.4
Sepal.Length  0.361 -0.657  0.582  0.315
Sepal.Width         -0.730 -0.598 -0.320
Petal.Length  0.857  0.173        -0.480
Petal.Width   0.358        -0.546  0.754

But when we actually inspect one of the components we see its full values

irispca$loadings[,1]

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
  0.36138659  -0.08452251   0.85667061   0.35828920

Taking this into account we have

is1 <- list()
is1$Comp.1 <- iris2 %*% irispca$loadings[,1]
is1$Comp.2 <- iris2 %*% irispca$loadings[,2]
is1$Comp.3 <- iris2 %*% irispca$loadings[,3]
is1$Comp.4 <- iris2 %*% irispca$loadings[,4]
score1 <- as.data.frame(is1)

which gives

head(score1, 2)

Comp.1     Comp.2     Comp.3      Comp.4
-2.684126 -0.3193972 0.02791483 0.002262437
 2.714142  0.1770012 0.21046427 0.099026550


 head(irispca$scores, 2)
         Comp.1     Comp.2     Comp.3      Comp.4
 [1,] -2.684126 -0.3193972 0.02791483 0.002262437
 [2,] -2.714142  0.1770012 0.21046427 0.099026550

A final thing to note, which wasn't asked but can often cause confusion, is that if v is a principle component than -1 * v is also a principle component. Many algorithms for determining them do not explicitly impose an orientation. From the docs

The signs of the columns of the loadings and scores are arbitrary, and so may differ between different programs for PCA, and even between different builds of R.

answered Mar 27 '16 at 16:18

mgilbert

3,495
4
22
39

Thank you so much for sharing this elaborate answer. It is really helpful. I think, the mistake that I was making was that I was not subtracting mean from the values as you have done above using sweep(). I would keep a note that princomp does not reorder the order and that print method for princomp objects rounds values for purpose of display. Thank you again. – skumar Mar 27 '16 at 19:00
I am just wondering - what does irispca$scale mean? It gives out 1 for all 4 variables. What would happen if scale values are different from 1? In that case, will be required to descale the data too? Please share your inputs on these points as well. – skumar Mar 27 '16 at 19:08
Looking at the source code of `stats:::princomp.default` it appears this is the standard deviations used for scaling if `cor=TRUE`, otherwise this is 1. If the above answer resolved your question please consider accepting it. – mgilbert Mar 27 '16 at 19:32
Thank you again. Your inputs make sense. I am not sure - what do you mean when you say "If the above answer resolved your question please consider accepting it. ". I have tried to click on up arrow sign that signifies "this answer is useful". – skumar Mar 28 '16 at 09:37
Here is an overview of how to do this http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work – mgilbert Mar 28 '16 at 10:24
Thank you again. There is a point on the shared link saying "You can upvote if you have gained the Vote Up privilege, awarded at 15 reputation.". Since I am not having 15 rep yet, appropriate result is not getting reflected. Hence, when I click on Vote Up, I get a message saying "Thanks for the feedback! Once you earn a total of 15 reputation, your votes will change the publicly displayed post score.". I think we have to vote before I get 15 :-(. – skumar Mar 28 '16 at 11:44
This is to upvote a question/answer that you think is good. However accepting an answer should be a privileged that requires no reputation. This is under the section in the linked page of "Accepting Answers: How does it work?" And involves clicking the check mark. In the link it is the portion with a picture demonstrating this – mgilbert Mar 28 '16 at 12:05
Thank you for patiently explaining. I think I have understood it. I have accepted the answer, finally :-). – skumar Mar 28 '16 at 12:48
Glad it was helpful. Welcome to SO. This is also a very good read for reference to future R questions http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – mgilbert Mar 28 '16 at 12:50

Calculating new variables from Principal Components - PCA in R

1 Answers1