1

I am asking a question to a similar post posted up 2 years ago, with no full answer to it (subset of prcomp object in R). P.S. sorry for commenting on it for an answer..

Basically, my question is the same. I have generated a PCA table using prcomp that has 10000+ genes, and 1700+ cells, made up of 7 timepoints. Plotting all of them in a single file makes it difficult to see.

I would like to plot each timepoint separately, using the same PCA results table (ie without re-running prcomp).

Thanks Dean for giving me tips on posting. To think of a way to describe my dataset without actually loading it here, will take me a week I believe. I also tried the

dput(droplevels(head(object,2))) 

option, but it was just too much info since I have such a large dataset. In short, it is a large matrix of single-cell dataset where people can commonly see on packages such as Seurat (https://satijalab.org/seurat/pbmc3k_tutorial_1_4.html). EDIT: I have posted a screenshot of a subset of my matrix here (enter image description here).

Sorry I don't know how to re-create this or even export a text format.. But this is what I can provide: My TPM matrix has 16541 rows (defining genes), and 1798 columns (defining cells).

In it, I have "re-labelled" my columns based on timepoints, using codes such as:

D0<-c(colnames(TPM[,grep("20180419-24837-1-*", colnames(TPM))])) #D0: 286 cells

D7<-c(colnames(TPM[,grep("20180419-24837-2-*", colnames(TPM))])) #D7: 237 cells

D10<-c(colnames(TPM[,grep("20180419-24947-5-*", colnames(TPM))])) #D10: 304 cells

...... and I continued to label each timepoint.

Each timepoint was also given a specific colour.

rc<-rep("white", ncol(TPM))

rc<-[,grep("20180419-24837-1-*", colnames(TPM))]= "magenta"

...... and I continued to give colour to each timepoint.

I performed a PCA using this code:

pcaRes<-prcomp(t(log(TPM+1)), center= TRUE, scale. = TRUE)

Then I proceeded to plot a PCA plot using:

 plot(pcaRes$x[,1], pcaRes$x[,2], xlab="PC1", ylab="PC2",
 cex=1.0, col= rc, pch=16, main="")

Then I when I wanted to plot a PCA plot only with D0, using the same PCA output (pcaRes).. This is where I am stuck.

P.S. If anyone else has an easier way of advising how to input an example data here from my large matrix, I welcome any help. Thanks so much! Sorry I am very new in bioinformatics.

zx8754
  • 52,746
  • 12
  • 114
  • 209
MeganS92
  • 13
  • 3
  • Hi @MeganS92, thanks for the question. Rather than describing how your matricies are structured, are you able to present a small, self-contained example? Either using a built in R data set, of copy paste some code to create one? This will help others copy/paste and present back a solution. Also check out this link for more info: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Dean Sep 03 '18 at 05:41
  • Hey @MeganS92, if the thing you are trying to do doesn't depend on your specific bioinformatics data. You can type `?princomp` or `?prcomp` and look in the examples section of the help guide. Often they have an internal R datasets like `USArrests` or `stackloss` to showcase the functions. Can you reproduce/fudge these datasets? i.e. make up some fake timepoints. – Dean Sep 03 '18 at 06:27
  • Thanks Dean for this. I have looked up both of the datasets that you suggested, but I don't think either fully describes my dataset. I think the easiest would be to look at some fake single-cell dataset like the one from the Seurat tutorial. However, in saying that, it could just be that my level of bioinformatics is too low and so don't fully know how to provide this dataset. I would love to reproduce this dataset, which is what I was trying to do, but honestly, it would take me another day to figure out how to even do that easily. Sorry. I hope the screenshot of my data gives some hint. – MeganS92 Sep 03 '18 at 06:39
  • Out of curiosity, did you read the data in this this: pbmc.data <- Read10X(data.dir = "hg19/") pbmc <- CreateSeuratObject(raw.data = pbmc.data, min.cells = 3, min.genes = 200, project = "10X_PBMC") – tcratius Sep 03 '18 at 08:05
  • Hi @ConradThiele, I needed to first download the data via a download link, and untar it since it's a zip file. After doing that, I loaded it pretty much the way that you did. Could I just clarify, if this question is related to my original question regarding prcomp, this is NOT the ACTUAL dataset that I am using. But, it's close enough in some ways... – MeganS92 Sep 03 '18 at 23:17
  • Oh course it's related to the original problem, I come on here to help solve problems, cause I'm weird, like the rest of the people that do this. In order to help I have to learn the code you use and vaguely understand the techniques behind what your are doing. Are you able to plot this: VlnPlot(object = your.date, features.plot = your.date$D0, nCol = 3). It might be just D0 not your.date$D0 but try both. As you saw, this is not a common question on stackoverflow, hence most have not encountered it, yet someone must know. Worst case scenario is to reach out to Seurat himself. – tcratius Sep 04 '18 at 01:03
  • Sorry Conrad. I did not mean to be rude or ignorant. I just wanted to be sure. Could i clarify that when I am plotting my data on Seurat, I have no issues in terms of trying to sub-plot PCA plots. i simply use codes such as : object_subset<-SubsetData(object,cells.use=rownames(object@meta.data[object@meta.data$timepoint %in% c("D0", "D7"),]))) And then I just plot the pCA plot using the object subset that I just created. Bearing in mind that I did not re-run my original PCA, then this would allow me to just plot the subset of data. – MeganS92 Sep 04 '18 at 02:45
  • However, my question is when I am trying to do this OUTSIDE of Seurat. Just using prcomp to perform pCA, and then plotting it using either the "plot" function or "ggbiplot", i am unable to do so. I have reached out to Seurat to ask for other Seurat-related questions, but I do not think this is one. You may ask why am i doing this outside of Seurat? Because my PCA looks totally different with Seurat and using prcomp, and I would like to properly compare. Thanks so much! – MeganS92 Sep 04 '18 at 02:50
  • Ah...sorry for the late reply, neighbours kept me up most of the weekend. And no bother, you are fine. The dataset is transformed into a specific data structure which is of class S4. Where as, most data used for plotting statistics for say a business, surveys etc, are in an class S3 either data.table or data.frame similar to what you would find in an Excel spread sheet, 2 dimensional, points x and y. Conversely, data like Spectral and Seurat's is layered differently which make it hard to use anything but the specific functions made for accessing these layers. I try to do eg later :) – tcratius Sep 04 '18 at 08:10

2 Answers2

0

Stack Exchange for Bioinformatics is where you you will need to go to ask question(s) or learn about the package(s) and function(s) you need to deal with you area of specialty. Stack Exchange for Bioinformatics is linked with Stackoverflow so you will just need to join, you'll have the same login.

Classes S3, S4 and Base.

This Very basic over view of Classes in R. Think of a Class as the parent you inherit all of their skills or abilities from and as a result you are able to achieve certain tasks better than others and some cases, you will not be able to do the task at all.

In R and all programming, to save re-inventing the wheel, parent classes are created so that the average person does not have to repeatedly write a function to do something simple like plot() a graph. This stuff is hidden, to access it, you inherit from the parent. The child reads the traits off the parent(s), and then it either performs the task or gives you a cryptic error message.

Base and S3 classes work well together, they are like the working class people of the R world. S4 is a specialized class made for specific fields of study to be able to provide specific functionality needed in their industry. This mean you can only use certain Base and S3 functions with Class S4 functions, most are just not compatible. So it's nothing you've done wrong, plot() and ggplot() just have the wrong parent(s) to work with your dataset.

Typical Base and S3 Class dataframe: Box like structure. Along the left hand side is all the column names, nice and neatly stacked on top of each other.

Typical Base and S3 Class data frame

Seurat S4 Class dataframe: Tree like structure, formatted to be read by a specific function(s).

enter image description here

Well hope that helps and I wish you well in your career. Cheers Conrad

Ps if this helps, then click the arrow up. :)

tcratius
  • 525
  • 6
  • 15
0

thanks @ConradThiele for your suggestion, I will check out that site.

I had a chat with other bioinformatics around the institute. My query has little to do with the object being an S4 class, since I am performing prcomp outside of the package. I have extracted my matrix out of the object and then ran prcomp on it.

Solution is simple: run prcomp with full dataset, transform the prcomp output into a dataframe, input additional columns to input additional details like "timepoint", create new dataframe(s) only with the "timepoint"/ "variable" of interest from the prcomp result, make multiple sub-dataframe and then plotting these using "plot" or whatever function you use.

This was not my solution but from a bioinformatition I went for help to in my institute. Hope this helps others! Thanks again for your time.

P.S. If I have the time, I will post a copy of the code I suggested soon.

MeganS92
  • 13
  • 3