0

Obviously one of the amazing things about data.table is its blazing speed. However, I just encountered a case where dplyr outperformed data.table. This is surprising. My first instinct is that my data.table as written is not optimal. Any ideas on a better way to write this data.table code?

library(dplyr)
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
microbenchmark::microbenchmark(
  ## dplyr
  starwars %>% 
    group_by(eye_color) %>% 
    filter(mass == min(mass)),
  ## data.table
  as.data.table(starwars)[, .SD[mass == min(mass)], by = eye_color],
  times = 100L
)
#> Unit: microseconds
#>                                                               expr
#>     starwars %>% group_by(eye_color) %>% filter(mass == min(mass))
#>  as.data.table(starwars)[, .SD[mass == min(mass)], by = eye_color]
#>       min        lq      mean   median       uq       max neval
#>   677.401  747.9005  963.1229  849.151  925.550  8820.301   100
#>  4927.201 5087.7510 6029.7660 5485.051 6065.901 19713.301   100

Created on 2019-07-23 by the reprex package (v0.3.0)

boshek
  • 4,100
  • 1
  • 31
  • 55
  • 2
    It seems that it has been discussed [Extract row corresponding to minimum value of a variable by group](https://stackoverflow.com/questions/24070714/extract-row-corresponding-to-minimum-value-of-a-variable-by-group). – FENG QI Jul 23 '19 at 18:17
  • "My first instinct is that my data.table as written is not optimal" -- yeah, `as.data.table` takes time and really should not be considered part of the timing (since if you're using a data.table, you have one already); and `.SD[cond]` is known to be slower than the `.I` trick or a "self join", both found in the links. Btw, starwars is probably too small a table to compare timings that will matter (ie, involve real time waiting for them to complete). – Frank Jul 23 '19 at 19:01
  • @Frank I'd respectfully disagree that `as.data.table` should be considered part of the timing. There are many instances where one isn't provisioned an object with a `data.table` class. Turning it into a `data.table` is often an inescapable part of the work flow. – boshek Jul 23 '19 at 20:28
  • 1
    Ok. Fyi, with data arriving in a non-exotic data.frame format, you can use `setDT(dat)` to convert it to data.table in-place almost instantly. (This doesn't work on starwars, since it is a built-in dataset that blocks in-place modification.) – Frank Jul 23 '19 at 20:36

0 Answers0