Obviously one of the amazing things about data.table
is its blazing speed. However, I just encountered a case where dplyr
outperformed data.table
. This is surprising. My first instinct is that my data.table
as written is not optimal. Any ideas on a better way to write this data.table
code?
library(dplyr)
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
microbenchmark::microbenchmark(
## dplyr
starwars %>%
group_by(eye_color) %>%
filter(mass == min(mass)),
## data.table
as.data.table(starwars)[, .SD[mass == min(mass)], by = eye_color],
times = 100L
)
#> Unit: microseconds
#> expr
#> starwars %>% group_by(eye_color) %>% filter(mass == min(mass))
#> as.data.table(starwars)[, .SD[mass == min(mass)], by = eye_color]
#> min lq mean median uq max neval
#> 677.401 747.9005 963.1229 849.151 925.550 8820.301 100
#> 4927.201 5087.7510 6029.7660 5485.051 6065.901 19713.301 100
Created on 2019-07-23 by the reprex package (v0.3.0)