An efficient way of aggregating data from repeated measurements

Question

I'm analyzing gene expression data from a large experiment (12400 single cells and 23800 genes) and I'm running into an efficiency problem. I will write a reproducible example below but my problem is the following:

I converted mouse genes in my dataset to human counterparts to be able to compare with other previously published data. There are multiple matches in some cases (one human gene is mapped to more than one mouse genes). In these cases, I'd like to average the expression values from these multiple genes and come up with one expression value for the human genetic counterpart. I'm able to achieve this by converting my expression data to matrix format (which allows duplicate row names) and applying aggregate() function, but it takes a VERY long time to go through the large dataset. It is difficult to exemplify the exact situation here, but I my mock analytical pipeline is below:

data <- as.matrix(data.frame(cell1 = c(1,1,1,1,3,3),
                          cell2 = c(1, 2 ,4 ,10,5,10),
                          cell3 = c(0,0,0,1,10,20),
                          cell4 = c(1,3,4,4,20,20)))

# Adding gene names as rownames
rownames(data) <- c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")



# Mock gene expression matrix
# Columns indicate expression values from individual cells
# Rows indicate genes 
data
#>      cell1 cell2 cell3 cell4
#> ABC1     1     1     0     1
#> ABC2     1     2     0     3
#> ABC2     1     4     0     4
#> ABC4     1    10     1     4
#> ABC5     3     5    10    20
#> ABC5     3    10    20    20



# Averaging gene expression values where there are multiple measurements for the same gene
aggr_data <- aggregate(data, by=list(rownames(data)), mean)

# End result I'm trying to achieve
aggr_data
#>   Group.1 cell1 cell2 cell3 cell4
#> 1    ABC1     1   1.0     0   1.0
#> 2    ABC2     1   3.0     0   3.5
#> 3    ABC4     1  10.0     1   4.0
#> 4    ABC5     3   7.5    15  20.0

Is there a more efficient way for doing this?

Thanks for your answers!

younggeun · Accepted Answer · 2018-11-10T06:19:15.440

You can try dplyr. summarise_all with mean() function offers average of every columns for each group.

library(tidyverse) # including dplyr
(df <-
  data_frame(
    cell1 = c(1,1,1,1,3,3),
    cell2 = c(1, 2 ,4 ,10,5,10),
    cell3 = c(0,0,0,1,10,20),
    cell4 = c(1,3,4,4,20,20),
    gene_name = c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")
  ))
#> # A tibble: 6 x 5
#>   cell1 cell2 cell3 cell4 gene_name
#>   <dbl> <dbl> <dbl> <dbl> <chr>    
#> 1     1     1     0     1 ABC1     
#> 2     1     2     0     3 ABC2     
#> 3     1     4     0     4 ABC2     
#> 4     1    10     1     4 ABC4     
#> 5     3     5    10    20 ABC5     
#> 6     3    10    20    20 ABC5

I just added the gene names as additional row. Now you can use group_by() for the group operation

df %>%
  group_by(gene_name) %>% # for each group
  summarise_all(mean) # calculate mean for all columns
#> # A tibble: 4 x 5
#>   gene_name cell1 cell2 cell3 cell4
#>   <chr>     <dbl> <dbl> <dbl> <dbl>
#> 1 ABC1          1   1       0   1  
#> 2 ABC2          1   3       0   3.5
#> 3 ABC4          1  10       1   4  
#> 4 ABC5          3   7.5    15  20

In general, for large data set as your situation, data.table package would be appropriate: the code is like this

setDT(df)[, lapply(.SD, mean), by = gene_name]
#>    gene_name cell1 cell2 cell3 cell4
#> 1:      ABC1     1   1.0     0   1.0
#> 2:      ABC2     1   3.0     0   3.5
#> 3:      ABC4     1  10.0     1   4.0
#> 4:      ABC5     3   7.5    15  20.0

setDT is just for making data.table object.

dplyr vs data.table

If bind your data set,

df_bench
#># A tibble: 18,000 x 10,001
#>   gene_name cell1 cell2 cell3 cell4 cell5 cell6 cell7
#>   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ABC308        1     1     0     1     1     1     0
#> 2 ABC258        1     2     0     3     1     2     0
#> 3 ABC553        1     4     0     4     1     4     0
#> 4 ABC57         1    10     1     4     1    10     1
#> 5 ABC469        3     5    10    20     3     5    10
#> 6 ABC484        3    10    20    20     3    10    20
#> 7 ABC813        1     1     0     1     1     1     0
#> 8 ABC371        1     2     0     3     1     2     0
#> 9 ABC547        1     4     0     4     1     4     0
#>10 ABC171        1    10     1     4     1    10     1
#># ... with 17,990 more rows, and 9,993 more variables:
#>#   cell8 <dbl>, cell9 <dbl>, cell10 <dbl>,
#>#   cell11 <dbl>, cell12 <dbl>, cell13 <dbl>,
#>#   cell14 <dbl>, cell15 <dbl>, cell16 <dbl>,
#>#   cell17 <dbl>, cell18 <dbl>, cell19 <dbl>,
#>#   cell20 <dbl>, cell21 <dbl>, cell22 <dbl>,
#>#   cell23 <dbl>, cell24 <dbl>, cell25 <dbl>,
#>#   cell26 <dbl>, cell27 <dbl>, cell28 <dbl>,
#>#   cell29 <dbl>, cell30 <dbl>, cell31 <dbl>,
#>#   cell32 <dbl>, cell33 <dbl>, cell34 <dbl>,
#>#   cell35 <dbl>, cell36 <dbl>, cell37 <dbl>,
#>#   cell38 <dbl>, cell39 <dbl>, cell40 <dbl>,
#>#   cell41 <dbl>, cell42 <dbl>, cell43 <dbl>,
#>#   cell44 <dbl>, cell45 <dbl>, cell46 <dbl>,
#>#   cell47 <dbl>, cell48 <dbl>, cell49 <dbl>,
#>#   cell50 <dbl>, cell51 <dbl>, cell52 <dbl>,
#>#   cell53 <dbl>, cell54 <dbl>, cell55 <dbl>,
#>#   cell56 <dbl>, cell57 <dbl>, cell58 <dbl>,
#>#   cell59 <dbl>, cell60 <dbl>, cell61 <dbl>,
#>#   cell62 <dbl>, cell63 <dbl>, cell64 <dbl>,
#>#   cell65 <dbl>, cell66 <dbl>, cell67 <dbl>,
#>#   cell68 <dbl>, cell69 <dbl>, cell70 <dbl>,
#>#   cell71 <dbl>, cell72 <dbl>, cell73 <dbl>,
#>#   cell74 <dbl>, cell75 <dbl>, cell76 <dbl>,
#>#   cell77 <dbl>, cell78 <dbl>, cell79 <dbl>,
#>#   cell80 <dbl>, cell81 <dbl>, cell82 <dbl>,
#>#   cell83 <dbl>, cell84 <dbl>, cell85 <dbl>,
#>#   cell86 <dbl>, cell87 <dbl>, cell88 <dbl>,
#>#   cell89 <dbl>, cell90 <dbl>, cell91 <dbl>,
#>#   cell92 <dbl>, cell93 <dbl>, cell94 <dbl>,
#>#   cell95 <dbl>, cell96 <dbl>, cell97 <dbl>,
#>#   cell98 <dbl>, cell99 <dbl>, cell100 <dbl>,
#>#   cell101 <dbl>, cell102 <dbl>, cell103 <dbl>,
#>#   cell104 <dbl>, cell105 <dbl>, cell106 <dbl>,
#>#   cell107 <dbl>, …

Using this set,

microbenchmark::microbenchmark(
  DPLYR = {
    df_bench %>%
      group_by(gene_name) %>%
      summarise_all(mean)
  },
  DATATABLE = {
    setDT(df_bench)[, lapply(.SD, mean), by = gene_name]
  },
  times = 50
)
#> Unit: seconds
#>       expr      min       lq     mean   median       uq      max neval
#>      DPLYR 32.82307 34.89050 38.10948 37.44543 40.01937 47.67549    50
#>  DATATABLE 12.16752 13.59018 16.09665 14.25976 15.60752 40.30257    50

data.table seems faster than dplyr here.

Thanks for the detailed explanation here. Thumbs up! – Atakan Nov 12 '18 at 17:14 — Atakan, Nov 12 '18 at 17:14

Mike H. · Answer 2 · 2018-11-10T04:20:50.770

1

Using data.table should work pretty well:

library(data.table)
as.data.table(data)[, lapply(.SD, mean), by = .(rownames(data))]
#   rownames cell1 cell2 cell3 cell4
#1:     ABC1     1   1.0     0   1.0
#2:     ABC2     1   3.0     0   3.5
#3:     ABC4     1  10.0     1   4.0
#4:     ABC5     3   7.5    15  20.0

A quick SO search dug up a link to speed comparisons for group-by operations (data.table is the fastest for large data):

Calculate the mean by group

edited Nov 10 '18 at 04:20

answered Nov 10 '18 at 04:10

Mike H.

13,960
2
29
39

Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there! – Atakan Nov 12 '18 at 17:13

An efficient way of aggregating data from repeated measurements

2 Answers2

dplyr vs data.table