I have a single-cell RNA-seq dataset. It is a matrix with 30,561 genes x 21,748 cells.
I need to subset the cells into clusters ranging 3500 to 100 cells (10 of the clusters have >1000 cells), and then calculate pairwise correlations between the cells in the cluster.
My current code is as follows:
## function definition
getPairwisePearsonCorrelation <- function(celltype) {
print(paste("Working on", celltype))
## subset cells of a cluster from original matrix
tmp <- subset(sc_matrix, idents = celltype)
expr <- tmp[["RNA"]]@counts
## remove genes with no expression in the cluster
zeros <- which(Matrix::rowSums(expr) == 0)
if (length(zeros) > 0) {
expr <- data.matrix(expr[-zeros, ])
}
## split the cluster into cells from young and old samples
old_r <-rownames(tmp@meta.data)[which(tmp@meta.data$age_3_bin == "old")]
young_r <-rownames(tmp@meta.data)[which(tmp@meta.data$age_3_bin == "young")]
## correlation of young and old separately
corr_y <- do.call(rbind,lapply(1:(length(young_r)-1), function(x) {
do.call(rbind,lapply( (x+1):length(young_r), function(y) {
cc<-cor( expr[,young_r[x]], expr[,young_r[y]],method="pearson" )
cbind.data.frame(
age="young",
celltype=celltype,
cell1=young_r[x],
cell2=young_r[y],
cor=cc
)
} ) )
} ))
corr_o <- do.call(rbind,lapply(1:(length(old_r)-1), function(x) {
do.call(rbind,lapply( (x+1):length(old_r), function(y) {
cc<-cor( expr[,old_r[x]], expr[,old_r[y]],method="pearson" )
cbind.data.frame(
age="old",
celltype=celltype,
cell1=young_r[x],
cell2=young_r[y],
cor=cc
)
} ) )
} ))
## return output
rbind.data.frame(
corr_y,
corr_o
)
}
## select only the first 3 clusters for test run
ctlist=unique(sc_matrix$cell_ontology_class)[1:3]
## run
pearson_pairwise_df_tmp <- do.call(
rbind, lapply(ctlist, function(x) getPairwisePearsonCorrelation(x)))
Running this code only for the first 3 cell clusters (with sizes after splitting: 3501, 2498, 1007, 536, 384, 346) took ~12 hours and generated a dataframe with 10,000,000 rows. Yes, this includes the two largest clusters.
I need to run this using all the clusters for 2 different cell cluster methods. And repeat the whole process in yet another matrix of similar size (28,846 genes x 13,594 cells). And probably repeat and re-test multiple other datasets.
Is there a more efficient way to do this?
Thanks in advance.