Fast way to count the number of observations by group in R

Question

I have a data frame of abut 60 million observations. What would be the most efficient way of counting the number of observations by group in R? I tried group_by() %>% summarize(n = n()) and count(). Both are taking too long a time in my Windows 10 PC (i9-9900k, 64 GB). I would appreciate your tip. Thank you.

`setDT(df)[, .N, by = .(groupcol)]` should be the fastest. But you'll using package `data.table`. — Rui Barradas, Jul 18 '20 at 21:58
Possible duplicate https://stackoverflow.com/questions/9809166/count-number-of-rows-within-each-group — Ronak Shah, Jul 19 '20 at 01:35

user12728748 · Answer 1 · 2020-07-19T00:43:51.567

Maybe data.table might be a bit more efficient.

Edit: benchmark

Edit #2: Expanded benchmark; used data.frame instead of data.table

library(microbenchmark)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
set.seed(123)
df <- data.frame(group=sample(LETTERS[1:26], 1e7, replace=TRUE),
                 row1=rnorm(1e7), stringsAsFactors = FALSE)
ti <- tibble(df) # for dplyr
DT <- data.table(df) # for data.table

microbenchmark(data.table=DT[, .N, by=group],
    dplyr=ti %>% group_by(group) %>% tally(),
    tabulate=tabulate(factor(df$group)),
    table=table(df$group),
    times=10L)
#> Unit: milliseconds
#>        expr       min        lq     mean    median       uq      max neval  cld
#>  data.table  70.89869  83.70017 102.9951  85.22042 129.7559 165.3728    10 a   
#>       dplyr 163.26062 166.69943 178.0351 171.58726 173.7652 239.7959    10  b  
#>    tabulate 278.72801 289.18787 296.3020 294.53976 301.6141 323.6547    10   c 
#>       table 466.70126 499.04382 518.5858 509.88502 517.6628 586.3363    10    d

^{Created on 2020-07-18 by the reprex package (v0.3.0)}

Upvote. You could include a timing, since it's the question's motivation. Mine is 1s. — Rui Barradas, Jul 18 '20 at 22:03
For the sake of comparison we could include `table(DT$group)` from `Base R` from tests on my system it will be the slowest of the three. or maybe more representative of base would be to use a `data.frame` as well. — Daniel O, Jul 19 '20 at 00:09

Fast way to count the number of observations by group in R

1 Answers1