-2

I have some biological data that looks like this, with 2 different types of clusters (A and B):

                Cluster_ID       A1      A2      A3       B1       B2      B3
 5  chr5:100947454..100947489,+   3.31322  7.52365  3.67255  21.15730  8.732710 17.42640
12 chr5:101227760..101227782,+   1.48223  3.76182  5.11534  15.71680  4.426170 13.43560
29 chr5:102236093..102236457,+  15.60700 10.38260 12.46040   6.85094 15.551400  7.18341

I clean up the data:

CAGE<-read.table("CAGE_expression_matrix.txt", header=T)
CAGE_data <- as.data.frame(CAGE)

#Remove clusters with 0 expression for all 6 samples
CAGE_filter <- CAGE[rowSums(abs(CAGE[,2:7]))>0,]

#Filter whole file to keep only clusters with at least 5 TPM in at least 3 files
CAGE_filter_more <- CAGE_filter[apply(CAGE_filter[,2:7] >= 5,1,sum) >= 3,]
CAGE_data <- as.data.frame(CAGE_filter_more)

The data size is reduced from 6981 clusters to 599 after this.

I then go on to apply PCA:

#Get data dimensions

dim(CAGE_data)
PCA.CAGE<-prcomp(CAGE_data[,2:7], scale.=TRUE) 
summary(PCA.CAGE)

I want to create a PCA plot of the data, marking each sample and coloring the samples depending on their type (A or B.) So it should be two colors for the plot with text labels for each sample.

This is what I have tried, to erroneous results:

qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data, data=as.data.frame(PCA.CAGE$x))

ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more, label=CAGE_filter_more)) + geom_point() + geom_text()

qplot(PCA.CAGE[1:3], PCA.CAGE[4:6], label=colnames(PC1, PC2, PC3), geom=c("point", "text"))

The errors appear as such:

  > qplot(PCA.CAGE$x[,1:3],PCA.CAGE$x[4:6,], xlab="Data 1", ylab="Data 2")

  Error: Aesthetics must either be length one, or the same length as the dataProblems:PCA.CAGE$x[4:6, ]

  > qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data,    data=as.data.frame(PCA.CAGE$x))

  Don't know how to automatically pick scale for object of type data.frame.   Defaulting to continuous
  Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous
  Error: Aesthetics must either be length one, or the same length as the dataProblems:CAGE_data, CAGE_data

 > ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more,      label=CAGE_filter_more)) + geom_point() + geom_text()

 Error: ggplot2 doesn't know how to deal with data of class 
Paul
  • 26,170
  • 12
  • 85
  • 119
espop23
  • 21
  • 2
  • 6
  • What error are you getting? – Señor O Aug 27 '15 at 22:02
  • Edited above to show you! – espop23 Aug 27 '15 at 22:10
  • I never use qplot, but it seems pretty clear that the error you're getting from the last function is that PCA.CAGE is not a data.frame – Señor O Aug 27 '15 at 22:25
  • I set it to be a data frame at the beginning... do you have another suggestion for making the PCA plot in R? – espop23 Aug 27 '15 at 22:27
  • You did not set PCA.CAGE to data.frame at any point – Señor O Aug 27 '15 at 22:29
  • I did for the data used to make the PCA. I have tried this now 'PCA_data <- as.data.frame(PCA.CAGE)' and it tells me **Error in as.data.frame.default(PCA.CAGE) : cannot coerce class ""prcomp"" to a data.frame – espop23 Aug 27 '15 at 22:36
  • "I did for the data used to make the PCA" - Why would that matter if it's not what you're plotting? – Señor O Aug 27 '15 at 22:39
  • You need to get the data you want to plot into a data.frame format. Examine that data structure and read the documentation. Post a separate question if you have trouble doing that. – Señor O Aug 27 '15 at 22:40
  • `ggfortify` : https://github.com/sinhrks/ggfortify/ (github only install) has a `fortify` for PCA objects BUT you can also `autoplot(PCA.CAGE)` once you load that package and it does some ggplot magic for you. – hrbrmstr Aug 28 '15 at 01:13
  • I tried autoplot as you suggested, and it gave me the 'Error in UseMethod("autoplot") : no applicable method for 'autoplot' applied to an object of class "prcomp"' Do you know why this is? – espop23 Aug 28 '15 at 07:46

1 Answers1

2

Your question doesn't make sense (to me at least). You seem to have two groups of 3 variables (the A group and the B group). When you run PCA on these 6 variables, you'll get 6 principle components, each of which is a (different) linear combination of all 6 variables. Clustering is based on the cases (rows). If you want to cluster the data based on the first two PCs (a common approach), then you need to do that explicitly. Here's an example using the built-in iris data-set.

pca   <- prcomp(iris[,1:4], scale.=TRUE)
clust <- kmeans(pca$x[,1:2], centers=3)$cluster
library(ggbiplot)
ggbiplot(pca, groups=factor(clust)) + xlim(-3,3)

So here we run PCA on the first 4 columns of iris. Then, pca$x is a matrix containing the principle components in the columns. So then we run k-means clustering based on the first 2 PCs, and extract the cluster numbers into clust. Then we use ggibplot(...) to make the plot.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • How did you get the pca$x matrix? – espop23 Aug 28 '15 at 07:34
  • I don't understand. `prcomp(...)` returns a "prcomp" object, which is a named list. One of the elements, `x`, is a matrix containing the principle components. Type `str(pca)`. – jlhoward Aug 28 '15 at 16:27
  • Thanks I have come up with a plot using: 'PCA.CAGE<-prcomp(CAGE[,2:7], scale.=TRUE) summary(PCA.CAGE) qplot(PC1, PC2, data=as.data.frame(PCA.CAGE$x, geom = c("text")))' However, I am struggling with colouring it and getting the labels again. When I add colors it remains black. Do you know why this may be? – espop23 Aug 28 '15 at 16:45
  • This code should not run: the "text" geometry in ggplot plots labels instead of points, so you have to specify what to use for the labels. I suggest you read a tutorial on ggplot, perhaps [this one](http://blog.echen.me/2012/01/17/quick-introduction-to-ggplot2/). – jlhoward Aug 28 '15 at 17:04