2

I'm trying to create score plots of the first two principal components. I begin by splitting the data into three data frames based on class. I then transform the data and perform PCA.

My data is as follows:

14      1   82.0 12.80   7.60   1070   105   400
14      1   82.0 11.00   9.00    830   145   402
14      1  223.6 17.90  10.35   2200   135   500
15      1  164.0 14.50   9.80   1946   138   500
15      1  119.0 12.90   7.90   1190   140   400
15      1   74.5  7.50   6.30    653   177   350
15      1   74.5 11.13   8.28    930   113   402
16      1  279.5 14.30   9.40   1575   230   700
16      1   82.0  7.80   6.70    676   175   525
16      1   67.0 11.00   8.30    920   106   300
16      2  112.0 11.70   8.00   1353   140   560
16      2  149.0 12.80   8.70   1550   170   550
16      2  119.0  8.50   7.40    888   175   250
16      2  119.0 13.30   9.60   1275   157   450
16      2  238.5 14.90   8.90   1537   183   700
16      2  205.0 12.00   7.90   1292   201   600
16      2   82.0  9.40   6.20    611   209   175
16      2  119.0 15.95  10.25   1350   145   450
16      2  194.0 16.74  10.77   1700   120   450
17      2  336.0 22.20  10.90   3312   135   450
17      3  558.9 23.40  12.60   4920   152   600
17      3  287.0 14.30   9.40   1510   176   800
17      3  388.0 23.72  11.86   3625   140   500
17      3  164.0 11.90   9.80    900   190   600
17      3  194.0 14.40   9.20   1665   175   600
17      3  194.0 14.40   8.90   1640   175   600
17      3  186.3  9.70   8.00   1081   205   600
17      3  119.0  8.00   6.50    625   196   400
17      3  119.0  9.40   6.95    932   165   250
17      3   89.4 14.55   9.83   1378   146   400

Column 1: type, Column 2: class, Column 3: v1, Column 4: v2, Column 5: v3, Column 6: v4, Column 7: v5, Column 8: v6

My code is as follows:

data <- read.csv("data.csv")
result <- split(data, data$class);

data1 <- result[[1]][,3:8];
data1Logged <- log10(data1)
pca.data1Logged = prcomp( ~ v1 + 
                         v2 + 
                         v3 + 
                         v4 + 
                         v5 + 
                         v6, 
                       data = data1Logged, scale. = FALSE );

data2 <- result[[2]][,3:8];
data2Logged <- log10(data2)
pca.data2Logged = prcomp( ~ v1 + 
                         v2 + 
                         v3 + 
                         v4 + 
                         v5 + 
                         v6, 
                       data = data2Logged, scale. = FALSE );

data3 <- result[[3]][,3:8];
data3Logged <- log10(data3)
pca.data3Logged = prcomp( ~ v1 + 
                         v2 + 
                         v3 + 
                         v4 + 
                         v5 + 
                         v6, 
                       data = data3Logged, scale. = FALSE );

For each of the three class, I want to have a score plot for PC1 and PC2:

pca.data1Logged$x[,1:2]
pca.data2Logged$x[,1:2]
pca.data3Logged$x[,1:2]

This is the best I could figure out:

opar <- par(mfrow = c(1,3))
plot(pca.data1Logged$x[,1:2])
plot(pca.data2Logged$x[,1:2])
plot(pca.data3Logged$x[,1:2])
par(opar)

But I would like this plot to be scaled, coloured, superimposed, etc. I have started reading about ggplot, but I don't have the experience to do this. I would like something like the following:

enter image description here

https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html

The problem with the above is that I have broken the data into 3 separate data frames, so there are no headings for "class1", "class2, "class3".

UseR10085
  • 7,120
  • 3
  • 24
  • 54
Wyuw
  • 123
  • 5
  • For us to be able to help you, you should provide your data, help: [how-to-make-a-great-r-reproducible-example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). – jay.sf Aug 18 '20 at 05:29
  • 1
    @jay.sf One moment, I will try. The original data is too big, so I will have to make it shorter. – Wyuw Aug 18 '20 at 05:30
  • @jay.sf Is this ok? – Wyuw Aug 18 '20 at 05:39
  • the figure you linked is just `pr <- prcomp(iris[1:4], scale. = TRUE); plot(pr$x[, 1:2], col = iris$Species)`, so you can do something similar with yours: `pr <- prcomp(log10(data[, -(1:2)])); plot(pr$x[, 1:2], col = data$class)` – rawr Aug 18 '20 at 05:48
  • or by kmeans `km <- kmeans(pr$x[, 1:2], centers = 3); points(pr$x[, 1:2], col = km$cluster)` – rawr Aug 18 '20 at 05:51
  • @rawr Ahh! But how do we label the coloured dots? – Wyuw Aug 18 '20 at 05:53
  • the same way you color them. you didn't mention anything about labels `text(pr$x[, 1:2], labels = ifelse(iris$Sepal.Width > 3.5, rownames(iris), ''))` – rawr Aug 18 '20 at 05:56
  • @rawr it is in the example image. We need labels; otherwise, how do we know what is what? – Wyuw Aug 18 '20 at 05:57
  • oh you mean a `?legend` – rawr Aug 18 '20 at 05:58
  • @Wyuw I used your data in the way you provided it this time. However, for the future please really consider to use the methods in the link I sent you in my last comment, especially `dput`. The reason why this is important is, that we have no idea how the `str`ucture of your data might look like, and `dput` already includes it. – jay.sf Aug 18 '20 at 05:59
  • @rawr I just meant the labels for `class1`, `class2`, `class3` – Wyuw Aug 18 '20 at 06:04
  • 1
    @jay.sf Ok, thank you. I will remember. – Wyuw Aug 18 '20 at 06:04

2 Answers2

3

You can use factoextra and FactoMineR like

library("factoextra")
library("FactoMineR")

#PCA analysis
df.pca <- PCA(df[,-c(1,2)], graph = T)
# Visualize
# Use habillage to specify groups for coloring
fviz_pca_ind(df.pca,
             label = "none", # hide individual labels
             habillage = as.factor(df$class), # color by groups
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE # Concentration ellipses, legend.title = "Class")

enter image description here

You can change Dim1 and 2 to PC1 and 2 manually. For that, you can note the value of "Dim1 (63.9%)" and "Dim2 (23.3%)" from this plot and use the following code to change the Dim1 and 2 to PC1 and 2 like

fviz_pca_ind(df.pca,
             label = "none", # hide individual labels
             habillage = as.factor(df$class), # color by groups
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE, # Concentration ellipses
             xlab = "PC1 (63.9%)", ylab = "PC2 (23.3%)", legend.title = "Class")

enter image description here

If you want to log transform the data, then you can use

df[,3:8] <- log10(df[,3:8]) 

df.pca <- PCA(df, graph = T)

fviz_pca_ind(df.pca,
             label = "none", # hide individual labels
             habillage = as.factor(df$class), # color by groups
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE, # Concentration ellipses
legend.title = "Class")

To change Dim1 and 2 to PC1 and 2 manually, you can use the following code

fviz_pca_ind(df.pca,
             label = "none", # hide individual labels
             habillage = as.factor(df$class), # color by groups
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE, # Concentration ellipses
             xlab = "PC1 (64.9%)", ylab = "PC2 (22.6%)", legend.title = "Class")

Data

df =
structure(list(Type = c(14L, 14L, 14L, 15L, 15L, 15L, 15L, 16L, 
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 17L, 17L, 
17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L), class = c(1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), v1 = c(82, 82, 
223.6, 164, 119, 74.5, 74.5, 279.5, 82, 67, 112, 149, 119, 119, 
238.5, 205, 82, 119, 194, 336, 558.9, 287, 388, 164, 194, 194, 
186.3, 119, 119, 89.4), v2 = c(12.8, 11, 17.9, 14.5, 12.9, 7.5, 
11.13, 14.3, 7.8, 11, 11.7, 12.8, 8.5, 13.3, 14.9, 12, 9.4, 15.95, 
16.74, 22.2, 23.4, 14.3, 23.72, 11.9, 14.4, 14.4, 9.7, 8, 9.4, 
14.55), v3 = c(7.6, 9, 10.35, 9.8, 7.9, 6.3, 8.28, 9.4, 6.7, 
8.3, 8, 8.7, 7.4, 9.6, 8.9, 7.9, 6.2, 10.25, 10.77, 10.9, 12.6, 
9.4, 11.86, 9.8, 9.2, 8.9, 8, 6.5, 6.95, 9.83), v4 = c(1070L, 
830L, 2200L, 1946L, 1190L, 653L, 930L, 1575L, 676L, 920L, 1353L, 
1550L, 888L, 1275L, 1537L, 1292L, 611L, 1350L, 1700L, 3312L, 
4920L, 1510L, 3625L, 900L, 1665L, 1640L, 1081L, 625L, 932L, 1378L
), v5 = c(105L, 145L, 135L, 138L, 140L, 177L, 113L, 230L, 175L, 
106L, 140L, 170L, 175L, 157L, 183L, 201L, 209L, 145L, 120L, 135L, 
152L, 176L, 140L, 190L, 175L, 175L, 205L, 196L, 165L, 146L), 
    v6 = c(400L, 402L, 500L, 500L, 400L, 350L, 402L, 700L, 525L, 
    300L, 560L, 550L, 250L, 450L, 700L, 600L, 175L, 450L, 450L, 
    450L, 600L, 800L, 500L, 600L, 600L, 600L, 600L, 400L, 250L, 
    400L)), class = "data.frame", row.names = c(NA, -30L))
UseR10085
  • 7,120
  • 3
  • 24
  • 54
  • Wow! That looks very attractive! Can we change it from Dim1 and Dim2 to PC1 and PC2? – Wyuw Aug 18 '20 at 05:56
  • Thanks! How do I log the data in your example? I needs columns 3:8 log10. – Wyuw Aug 18 '20 at 06:13
  • So I did this to log10 the data: `data[,3:8] <- log10(data[,3:8])` And I get this: `> data.pca <- PCA(data[,-c(1,2)], graph = T)` `the condition has length > 1 and only the first element will be used` Is this ok, or did something go wrong? – Wyuw Aug 18 '20 at 06:17
  • But that eliminates `type` and `class` columns – Wyuw Aug 18 '20 at 06:25
  • You were correct. Fixed the problem and updated the answer. Restart your R session and rerun the code. Rerunning the code without restarting R session replaces original data with transformed data. – UseR10085 Aug 18 '20 at 06:32
  • I ony want `class`. But my point was that `aircraft_log <- log10(aircraft[,3:8])` eliminates the column with this information. – Wyuw Aug 18 '20 at 06:35
  • See my answer now, I have used what you have suggested. But the problem with your approach is that this `data[,3:8] <- log10(data[,3:8])` code replaces the original data with log-transformed data. So, rerunning the code without restarting R session may give you error. – UseR10085 Aug 18 '20 at 06:39
  • I see. So is it ok to use `PCA()` instead of `prcomp()`, as you did here? Also, is it possible to change where it says "Groups" to `class`? – Wyuw Aug 18 '20 at 06:40
  • There is no problem in using `PCA()` instead of `prcomp()`. Basically, they give same result. If you want to use `prcomp()`, then just replace `df.pca <- PCA(df[,3:8], graph = T)` with `df.pca <- prcomp(df[, 3:8], scale. = T)`. The `PCA()` comment does the scaling by default. – UseR10085 Aug 18 '20 at 06:53
  • For changing the "Groups" to `class`, use `, legend.title = "Class"`. See my updated answer. Don't forget to accept it as [answer](https://stackoverflow.com/help/someone-answers). – UseR10085 Aug 18 '20 at 07:13
  • Thanks for that. I get this error: `Error in scale.default(x, center = center, scale = scale.) : length of 'scale' must equal the number of columns of 'x'` – Wyuw Aug 18 '20 at 12:07
  • I can't reproduce your error with the provided dataset. So, it is very difficult to help you out. – UseR10085 Aug 18 '20 at 18:26
2

You could rbind the separate results and add a color column that you use in plot.

rb <- rbind(cbind(pca.data1Logged$x[,1:2], d=2),
            cbind(pca.data2Logged$x[,1:2], d=3),
            cbind(pca.data3Logged$x[,1:2], d=4))

plot(rb, col=rb[,"d"], pch=20, main="PCA Plot")
legend("bottomleft", paste("data", 1:3), col=2:4, pch=20)

enter image description here


Data:

data <- read.table(header=F, text="14      1   82.0 12.80   7.60   1070   105   400
14      1   82.0 11.00   9.00    830   145   402
14      1  223.6 17.90  10.35   2200   135   500
15      1  164.0 14.50   9.80   1946   138   500
15      1  119.0 12.90   7.90   1190   140   400
15      1   74.5  7.50   6.30    653   177   350
15      1   74.5 11.13   8.28    930   113   402
16      1  279.5 14.30   9.40   1575   230   700
16      1   82.0  7.80   6.70    676   175   525
16      1   67.0 11.00   8.30    920   106   300
16      2  112.0 11.70   8.00   1353   140   560
16      2  149.0 12.80   8.70   1550   170   550
16      2  119.0  8.50   7.40    888   175   250
16      2  119.0 13.30   9.60   1275   157   450
16      2  238.5 14.90   8.90   1537   183   700
16      2  205.0 12.00   7.90   1292   201   600
16      2   82.0  9.40   6.20    611   209   175
16      2  119.0 15.95  10.25   1350   145   450
16      2  194.0 16.74  10.77   1700   120   450
17      2  336.0 22.20  10.90   3312   135   450
17      3  558.9 23.40  12.60   4920   152   600
17      3  287.0 14.30   9.40   1510   176   800
17      3  388.0 23.72  11.86   3625   140   500
17      3  164.0 11.90   9.80    900   190   600
17      3  194.0 14.40   9.20   1665   175   600
17      3  194.0 14.40   8.90   1640   175   600
17      3  186.3  9.70   8.00   1081   205   600
17      3  119.0  8.00   6.50    625   196   400
17      3  119.0  9.40   6.95    932   165   250
17      3   89.4 14.55   9.83   1378   146   400")

names(data) <- c("sth", "class", paste0("v", 1:6))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Thank you! I get this error for `legend`: `Error in strwidth(legend, units = "user", cex = cex, font = text.font) : plot.new has not been called yet` – Wyuw Aug 18 '20 at 06:00
  • @Wyuw Probably your plotting device is just occupied from your last plot, just close it. Or run a fresh R session. – jay.sf Aug 18 '20 at 06:04