As I understand, data.table is more efficient and faster than dplyr but I found the opposite situation today in my job. I created a simulation to explain the situation.
library(data.table)
library(dplyr)
library(microbenchmark)
# data simulated
dt = data.table(A = sample(1:4247,10000, replace = T),
B = sample(1:119,10000,replace = T),
C = sample(1:6,10000,replace = T),
D = sample(1:30,10000,replace = T))
dt[,ID:=paste(A, ":::" ,
D,":::",
C)]
# execution time
microbenchmark(
DATA_TABLE = dt[, .(count=uniqueN(ID)),
by=c("A","B","C")
],
DPLYR = dt %>%
group_by(A,B,C) %>%
summarise(count = n_distinct(ID)),
times = 10
)
Results
Unit: milliseconds
expr min lq mean median uq max neval
DATA_TABLE 14241.57361 14305.67026 15585.80472 14651.16402 16244.22477 21367.56866 10
DPLYR 35.95123 37.63894 47.62637 48.56598 53.59919 62.63978 10
You can see the big difference! Does someone know the reason? Do you have some advice about when use dplyr or data.table?
I have my full code in data.table syntax now I don't know if I need to translate some chunks of code to dplyr due to this situation.
Thanks in advance.