1

I have a data frame of abut 60 million observations. What would be the most efficient way of counting the number of observations by group in R? I tried group_by() %>% summarize(n = n()) and count(). Both are taking too long a time in my Windows 10 PC (i9-9900k, 64 GB). I would appreciate your tip. Thank you.

Paul
  • 107
  • 4

1 Answers1

5

Maybe data.table might be a bit more efficient.

Edit: benchmark

Edit #2: Expanded benchmark; used data.frame instead of data.table

library(microbenchmark)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
set.seed(123)
df <- data.frame(group=sample(LETTERS[1:26], 1e7, replace=TRUE),
                 row1=rnorm(1e7), stringsAsFactors = FALSE)
ti <- tibble(df) # for dplyr
DT <- data.table(df) # for data.table

microbenchmark(data.table=DT[, .N, by=group],
    dplyr=ti %>% group_by(group) %>% tally(),
    tabulate=tabulate(factor(df$group)),
    table=table(df$group),
    times=10L)
#> Unit: milliseconds
#>        expr       min        lq     mean    median       uq      max neval  cld
#>  data.table  70.89869  83.70017 102.9951  85.22042 129.7559 165.3728    10 a   
#>       dplyr 163.26062 166.69943 178.0351 171.58726 173.7652 239.7959    10  b  
#>    tabulate 278.72801 289.18787 296.3020 294.53976 301.6141 323.6547    10   c 
#>       table 466.70126 499.04382 518.5858 509.88502 517.6628 586.3363    10    d

Created on 2020-07-18 by the reprex package (v0.3.0)

user12728748
  • 8,106
  • 2
  • 9
  • 14