I have a data frame of abut 60 million observations. What would be the most efficient way of counting the number of observations by group in R? I tried group_by() %>% summarize(n = n()) and count(). Both are taking too long a time in my Windows 10 PC (i9-9900k, 64 GB). I would appreciate your tip. Thank you.
Asked
Active
Viewed 365 times
1 Answers
5
Maybe data.table
might be a bit more efficient.
Edit: benchmark
Edit #2:
Expanded benchmark; used data.frame
instead of data.table
library(microbenchmark)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
set.seed(123)
df <- data.frame(group=sample(LETTERS[1:26], 1e7, replace=TRUE),
row1=rnorm(1e7), stringsAsFactors = FALSE)
ti <- tibble(df) # for dplyr
DT <- data.table(df) # for data.table
microbenchmark(data.table=DT[, .N, by=group],
dplyr=ti %>% group_by(group) %>% tally(),
tabulate=tabulate(factor(df$group)),
table=table(df$group),
times=10L)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> data.table 70.89869 83.70017 102.9951 85.22042 129.7559 165.3728 10 a
#> dplyr 163.26062 166.69943 178.0351 171.58726 173.7652 239.7959 10 b
#> tabulate 278.72801 289.18787 296.3020 294.53976 301.6141 323.6547 10 c
#> table 466.70126 499.04382 518.5858 509.88502 517.6628 586.3363 10 d
Created on 2020-07-18 by the reprex package (v0.3.0)

user12728748
- 8,106
- 2
- 9
- 14
-
Upvote. You could include a timing, since it's the question's motivation. Mine is 1s. – Rui Barradas Jul 18 '20 at 22:03
-
For the sake of comparison we could include `table(DT$group)` from `Base R` from tests on my system it will be the slowest of the three. or maybe more representative of base would be to use a `data.frame` as well. – Daniel O Jul 19 '20 at 00:09
-
-