Most efficient way in base R to do pairwise correlations between thousands of columns in a matrix

Question

I have a single-cell RNA-seq dataset. It is a matrix with 30,561 genes x 21,748 cells.

I need to subset the cells into clusters ranging 3500 to 100 cells (10 of the clusters have >1000 cells), and then calculate pairwise correlations between the cells in the cluster.

My current code is as follows:

## function definition
getPairwisePearsonCorrelation <- function(celltype) {
    print(paste("Working on", celltype))
    ## subset cells of a cluster from original matrix
    tmp <- subset(sc_matrix, idents = celltype)
    expr <- tmp[["RNA"]]@counts
    ## remove genes with no expression in the cluster
    zeros <- which(Matrix::rowSums(expr) == 0)
    if (length(zeros) > 0) {
      expr <- data.matrix(expr[-zeros, ])
    }
    ## split the cluster into cells from young and old samples
    old_r <-rownames(tmp@meta.data)[which(tmp@meta.data$age_3_bin == "old")]
    young_r <-rownames(tmp@meta.data)[which(tmp@meta.data$age_3_bin == "young")]
    
    ## correlation of young and old separately
    corr_y <- do.call(rbind,lapply(1:(length(young_r)-1), function(x) {  
                   do.call(rbind,lapply( (x+1):length(young_r), function(y) { 
                          cc<-cor( expr[,young_r[x]], expr[,young_r[y]],method="pearson" )
                          cbind.data.frame(
                            age="young",
                            celltype=celltype,
                            cell1=young_r[x],
                            cell2=young_r[y],
                            cor=cc
                          )
                      }   ) )
                  } ))
    corr_o <- do.call(rbind,lapply(1:(length(old_r)-1), function(x) {  
      do.call(rbind,lapply( (x+1):length(old_r), function(y) { 
        cc<-cor( expr[,old_r[x]], expr[,old_r[y]],method="pearson" )
        cbind.data.frame(
          age="old",
          celltype=celltype,
          cell1=young_r[x],
          cell2=young_r[y],
          cor=cc
        )
      }   ) )
    } ))
    
    ## return output
    rbind.data.frame(
      corr_y,
      corr_o
    )
  }

## select only the first 3 clusters for test run
ctlist=unique(sc_matrix$cell_ontology_class)[1:3]

## run
pearson_pairwise_df_tmp <- do.call(
                      rbind,  lapply(ctlist, function(x) getPairwisePearsonCorrelation(x)))

Running this code only for the first 3 cell clusters (with sizes after splitting: 3501, 2498, 1007, 536, 384, 346) took ~12 hours and generated a dataframe with 10,000,000 rows. Yes, this includes the two largest clusters.

I need to run this using all the clusters for 2 different cell cluster methods. And repeat the whole process in yet another matrix of similar size (28,846 genes x 13,594 cells). And probably repeat and re-test multiple other datasets.

Is there a more efficient way to do this?

Thanks in advance.

Cor is already vectorized e.g. `cor(dplyr::select(iris, -Species), method = c("pearson", "kendall", "spearman"))` gives you all pairs. Do `mcapply::mclapply` of this to each cluster. Instead of a nested lapply, use can use just combn, because cor matrices are symmetric. You do not need to do cor(b,a) if you already did cor(a,b) — danlooo, Feb 11 '22 at 12:36
Thanks. I don't get the part of how to use 'combn'. I tried to run it with: ```m_corr_y <- cor( expr[,young_r], method="pearson" ) above <- row(m_corr_y) <= col(m_corr_y) m_corr_y[above] <- NA df_y <- na.exclude(reshape2::melt(m_corr_y)) colnames(df_y) <- c("cell1","cell2","cor") df_y$age <- "young" df_y$celltype <- celltype``` and it seems it works fine. However, I don't know how efficient all this will be when running it in 1000+ cell clusters. — dkysh, Feb 11 '22 at 14:29
Nested lapply does this many comparisons: `expand.grid(letters[1:3], letters[1:3])`. But you only need this many: `t(combn(letters[1:3], m = 2))` — danlooo, Feb 11 '22 at 14:30
Thanks! Right now my code from my comment above sets the upper triangle of the matrix to NA, and uses reshape2::melt to convert it to a data.frame (after passing through na.exclude). combn seems pretty useful, specially when I'm comparing 3500x3500 cells, as it won't need to process the empty half of the gigantic matrix. However, when I try to use it ```m_corr_o[ t(combn(old_r, m = 2)) ]``` it returns just an array with no column and row names, and I cannot use that to generate my output dataframe. Do you have any insight on that? — dkysh, Feb 11 '22 at 23:36

Most efficient way in base R to do pairwise correlations between thousands of columns in a matrix

0 Answers0