I'm analyzing gene expression data from a large experiment (12400 single cells and 23800 genes) and I'm running into an efficiency problem. I will write a reproducible example below but my problem is the following:
I converted mouse genes in my dataset to human counterparts to be able to compare with other previously published data. There are multiple matches in some cases (one human gene is mapped to more than one mouse genes). In these cases, I'd like to average the expression values from these multiple genes and come up with one expression value for the human genetic counterpart. I'm able to achieve this by converting my expression data to matrix format (which allows duplicate row names) and applying aggregate()
function, but it takes a VERY long time to go through the large dataset. It is difficult to exemplify the exact situation here, but I my mock analytical pipeline is below:
data <- as.matrix(data.frame(cell1 = c(1,1,1,1,3,3),
cell2 = c(1, 2 ,4 ,10,5,10),
cell3 = c(0,0,0,1,10,20),
cell4 = c(1,3,4,4,20,20)))
# Adding gene names as rownames
rownames(data) <- c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")
# Mock gene expression matrix
# Columns indicate expression values from individual cells
# Rows indicate genes
data
#> cell1 cell2 cell3 cell4
#> ABC1 1 1 0 1
#> ABC2 1 2 0 3
#> ABC2 1 4 0 4
#> ABC4 1 10 1 4
#> ABC5 3 5 10 20
#> ABC5 3 10 20 20
# Averaging gene expression values where there are multiple measurements for the same gene
aggr_data <- aggregate(data, by=list(rownames(data)), mean)
# End result I'm trying to achieve
aggr_data
#> Group.1 cell1 cell2 cell3 cell4
#> 1 ABC1 1 1.0 0 1.0
#> 2 ABC2 1 3.0 0 3.5
#> 3 ABC4 1 10.0 1 4.0
#> 4 ABC5 3 7.5 15 20.0
Is there a more efficient way for doing this?
Thanks for your answers!