2

I started migrating some of my code from dplyr to dtplyr today and as I was it dawned on me that it would be relatively simple to copy and paste data.table code from the "dtplyr_step_mutate" , "dtplyr_step" object produced from calling lazy_dt() without calling as_tibble(). From here I could comment out the dtplyr code to preserve the readability of dplyr while getting the speed of data.table.

However as I did this I noticed that dtplyr would output data.table code that was sometimes slower than its dtplyr counterpart. This seems to go against the intuition that dtplyr should always be slower than data.table due to translation costs. Any ideas why this might be the case? The documentation notes that translating dplyr::mutate to data.table terminology is less than ideal. Is this perhaps the reason?

reprex:

library(tibble)
library(data.table)
library(dplyr,warn.conflicts = F)
library(dtplyr)
library(stringr)
library(magrittr)
library(microbenchmark)

test <- data.table(id= 1:2600000,a = rnorm(2600000),b = letters %>% rep(100000),.name_repair = 'universal') 
test2 <- data.table(id= 1:2600000,c = rnorm(2600000),.name_repair = 'universal') 


#copy call to data.table and paste into test4
test3 <- test %>% lazy_dt() %>% 
filter(str_detect(b,'[a-d]')) %>%  distinct(b,.keep_all = T) %>% 
inner_join(test2,by = 'id') %>%  mutate(.,x = rnorm(nrow(.))) #%>%  as_tibble() 

#change names of temporary tables to actual table names
test4 <- merge(unique(test[str_detect(b, "[a-d]")], by = "b"), test2, 
               all = FALSE, by.x = "id", by.y = "id", allow.cartesian = TRUE)[, `:=`(x = rnorm(nrow(.SD)))]

microbenchmark(dtplyr = test %>% lazy_dt() %>% 
                 filter(str_detect(b,'[a-d]')) %>%  distinct(b,.keep_all = T) %>% 
                 inner_join(test2,by = 'id') %>%  mutate(.,x = rnorm(nrow(.))) %>%  as_tibble() ,
                data.table = merge(unique(test[str_detect(b, "[a-d]")], by = "b"), test2, 
                      all = FALSE, by.x = "id", by.y = "id", allow.cartesian = TRUE)[, `:=`(x = rnorm(nrow(.SD)))])


Unit: milliseconds
       expr      min       lq     mean   median       uq      max neval
     dtplyr 465.2541 482.8849 502.2006 491.0723 503.6764 784.2746   100
 data.table 463.0258 480.6790 504.4545 494.1416 520.1842 588.5371   100

It's interesting to note that dtplyr did not appear to always be slower than data.table here. I am not really sure why I am seeing what I am seeing here.

EDIT: Rephrasing the question to be more clear. My thought process is that if I never have to show my code to anyone ever, then I should always use data.table as it is faster than dtplyr. Based on my experience benchmarking however, it seems that there are instances when dtplyr is faster than data.table. Why? Intuition seems to suggest this should never happen. Am I misunderstanding what dtplyr does?

mooboo
  • 29
  • 4
  • 5
    Are you really worried about a difference in 2 milliseconds? I'm not sure exactly what you are asking. The overhead of translating the calls would be tiny when compared to the work required to transform such a large amount of data. – MrFlick Feb 13 '20 at 17:23
  • @MrFlick I was more interested in why this was happening in the first place. I was under the impression dtplyr would always be slower than data.table. If this is not the case perhaps I am misunderstanding what dtplyr is doing in the first place? – mooboo Feb 13 '20 at 17:27
  • 5
    If you profile your code you will see that the `str_detect(b,'[a-d]')` part takes the majority of the time. If you really care about performance, use grep if you need partial matching, or, much faster: `merge(unique(test[b %chin% letters[1:4]], by = "b"), test2, all = FALSE, by = "id", allow.cartesian = TRUE)[, `:=`(x = rnorm(.N))]`. On my machine 30 ms vs 360... – user12728748 Feb 13 '20 at 18:06
  • @user12728748 neat. Thanks for advise on speeding up the code. I'll have to look more into %chin%. – mooboo Feb 13 '20 at 18:20

0 Answers0