I started migrating some of my code from dplyr to dtplyr today and as I was it dawned on me that it would be relatively simple to copy and paste data.table code from the "dtplyr_step_mutate" , "dtplyr_step" object produced from calling lazy_dt() without calling as_tibble(). From here I could comment out the dtplyr code to preserve the readability of dplyr while getting the speed of data.table.
However as I did this I noticed that dtplyr would output data.table code that was sometimes slower than its dtplyr counterpart. This seems to go against the intuition that dtplyr should always be slower than data.table due to translation costs. Any ideas why this might be the case? The documentation notes that translating dplyr::mutate to data.table terminology is less than ideal. Is this perhaps the reason?
reprex:
library(tibble)
library(data.table)
library(dplyr,warn.conflicts = F)
library(dtplyr)
library(stringr)
library(magrittr)
library(microbenchmark)
test <- data.table(id= 1:2600000,a = rnorm(2600000),b = letters %>% rep(100000),.name_repair = 'universal')
test2 <- data.table(id= 1:2600000,c = rnorm(2600000),.name_repair = 'universal')
#copy call to data.table and paste into test4
test3 <- test %>% lazy_dt() %>%
filter(str_detect(b,'[a-d]')) %>% distinct(b,.keep_all = T) %>%
inner_join(test2,by = 'id') %>% mutate(.,x = rnorm(nrow(.))) #%>% as_tibble()
#change names of temporary tables to actual table names
test4 <- merge(unique(test[str_detect(b, "[a-d]")], by = "b"), test2,
all = FALSE, by.x = "id", by.y = "id", allow.cartesian = TRUE)[, `:=`(x = rnorm(nrow(.SD)))]
microbenchmark(dtplyr = test %>% lazy_dt() %>%
filter(str_detect(b,'[a-d]')) %>% distinct(b,.keep_all = T) %>%
inner_join(test2,by = 'id') %>% mutate(.,x = rnorm(nrow(.))) %>% as_tibble() ,
data.table = merge(unique(test[str_detect(b, "[a-d]")], by = "b"), test2,
all = FALSE, by.x = "id", by.y = "id", allow.cartesian = TRUE)[, `:=`(x = rnorm(nrow(.SD)))])
Unit: milliseconds
expr min lq mean median uq max neval
dtplyr 465.2541 482.8849 502.2006 491.0723 503.6764 784.2746 100
data.table 463.0258 480.6790 504.4545 494.1416 520.1842 588.5371 100
It's interesting to note that dtplyr did not appear to always be slower than data.table here. I am not really sure why I am seeing what I am seeing here.
EDIT: Rephrasing the question to be more clear. My thought process is that if I never have to show my code to anyone ever, then I should always use data.table as it is faster than dtplyr. Based on my experience benchmarking however, it seems that there are instances when dtplyr is faster than data.table. Why? Intuition seems to suggest this should never happen. Am I misunderstanding what dtplyr does?