2

As I understand, data.table is more efficient and faster than dplyr but I found the opposite situation today in my job. I created a simulation to explain the situation.

library(data.table)
library(dplyr)

library(microbenchmark)

#  data simulated
dt = data.table(A = sample(1:4247,10000, replace = T),
                B = sample(1:119,10000,replace = T),
                C = sample(1:6,10000,replace = T),
                D = sample(1:30,10000,replace = T))

dt[,ID:=paste(A, ":::" , 
              D,":::",
              C)]
# execution time

microbenchmark(
  DATA_TABLE = dt[, .(count=uniqueN(ID)), 
                  by=c("A","B","C")
                  ],
  DPLYR      = dt %>% 
               group_by(A,B,C)  %>% 
               summarise(count = n_distinct(ID)),
  times = 10
              )

Results

Unit: milliseconds
       expr         min          lq        mean      median    uq         max        neval
 DATA_TABLE 14241.57361 14305.67026 15585.80472 14651.16402  16244.22477 21367.56866  10
      DPLYR    35.95123    37.63894    47.62637    48.56598  53.59919    62.63978     10 

You can see the big difference! Does someone know the reason? Do you have some advice about when use dplyr or data.table?

I have my full code in data.table syntax now I don't know if I need to translate some chunks of code to dplyr due to this situation.

Thanks in advance.

  • 2
    This thread probably has everything you need to know: https://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly – Matt Mar 10 '20 at 17:53
  • @Matt Thanks a lot for sharing! I will read it – william giraldo cuadros Mar 10 '20 at 18:06
  • I ran your code and data.table was about 30 times *faster* than dplyr on my machine. – eipi10 Mar 10 '20 at 18:07
  • 1
    I can (somewhat) reproduce `data.table`-slower on my machine, R-3.5.3, dplyr-0.8.3, data.table-1.12.8. – r2evans Mar 10 '20 at 18:11
  • 1
    @eipi10 maybe a versions thing? I got dplyr ~75x faster with R 3.5.2, dplyr-0.8.3, data.table-1.12.6 – Gregor Thomas Mar 10 '20 at 18:13
  • @eipi10 to be honest, I'm really confused because last week I had that situation for the same code (data.table faster than dplyr) but in this week I faced that issue (dplyr faster). So, I thought I was doing something wrong. I will check the other packages I have in my full R project to check if a new I added masks some functionalities of data.table. Thanks a lot! – william giraldo cuadros Mar 10 '20 at 18:14
  • 1
    @GregorThomas, maybe so. I'm running R 3.6.2, dplyr-0.8.99.9000, data.table-1.12.8. – eipi10 Mar 10 '20 at 18:15
  • The difference is almost entirely `uniqueN` vs `n_distinct`. – Gregor Thomas Mar 10 '20 at 18:19
  • 4
    Even replacing `uniqueN` with `length(distinct())` makes the methods fairly comparable. So I think the real question is *why is `uniqueN(ID)` so slow here, even compared to `length(unique(ID))`?* – Gregor Thomas Mar 10 '20 at 18:26
  • 8
    https://github.com/Rdatatable/data.table/issues/3739, *"uniqueN() is very slow compared to length(unique())"* – r2evans Mar 10 '20 at 18:40

1 Answers1

3

Here is another option:

dt[order(A, B, C), {
        uniqn <- rleidv(c(.SD, .(ID)))
        lastidx <- c(which(diff(rowidv(.SD))<1L), .N)
        c(.SD[lastidx], .(count=c(uniqn[lastidx[1L]], diff(uniqn[lastidx]))))
    }, .SDcols=cols]

timing code:

cols <- c("A","B","C")
microbenchmark(times=1L,

  DATA_TABLE = a00 <- dt[, .(count=uniqueN(ID)), cols],

  DATA_TABLE1 = a01 <- dt[, .(count=length(unique(ID))), cols],

  DPLYR      = a_dplyr <- dt %>%
    group_by(A,B,C)  %>%
    summarise(count = n_distinct(ID)),

  #https://github.com/Rdatatable/data.table/issues/1120#issuecomment-463584656
  mtd0 = a10 <- unique(dt, by=c(cols, "ID"))[, .(count=.N), cols],

  #https://github.com/Rdatatable/data.table/issues/1120#issuecomment-463597107
  mtd1 = a11 <- dt[, .N, c(cols, "ID")][, .(count=.N), cols],

  mtd2 = a2 <- dt[order(A, B, C), {
    uniqn <- rleidv(c(.SD, .(ID)))
    lastidx <- c(which(diff(rowidv(.SD))<1L), .N)
    c(.SD[lastidx], .(count=c(uniqn[lastidx[1L]], diff(uniqn[lastidx]))))
  }, .SDcols=cols]
)

checks:

> fsetequal(a00, a01)
[1] TRUE

> fsetequal(a00, setDT(a_dplyr))
[1] TRUE

> fsetequal(a00, a10)
[1] TRUE

> fsetequal(a00, a11)
[1] TRUE

> fsetequal(a00, a2)
[1] TRUE

timings for the particular dataset below:

Unit: milliseconds
        expr         min          lq        mean      median          uq         max neval
  DATA_TABLE 147478.1089 147478.1089 147478.1089 147478.1089 147478.1089 147478.1089     1
 DATA_TABLE1   4998.8236   4998.8236   4998.8236   4998.8236   4998.8236   4998.8236     1
       DPLYR 244081.6925 244081.6925 244081.6925 244081.6925 244081.6925 244081.6925     1
        mtd0   4519.4046   4519.4046   4519.4046   4519.4046   4519.4046   4519.4046     1
        mtd1   2866.5808   2866.5808   2866.5808   2866.5808   2866.5808   2866.5808     1
        mtd2    809.7442    809.7442    809.7442    809.7442    809.7442    809.7442     1

data with 1mio rows:

#R-3.6.1 64bit Win10
library(data.table)  #data.table_1.12.8 getDTthreads()==4
library(dplyr)  #dplyr_1.0.0
library(microbenchmark)

#  data simulated
set.seed(0L)
nr <- 1e6
dt = data.table(A = sample(1:424700,nr, replace = T),
  B = sample(1:11900,nr, replace = T),
  C = sample(1:600, nr, replace = T),
  D = sample(1:3000, nr, replace = T))
dt[,ID:=paste(A,":::",D,":::",C)]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • This should be used by default in data.table instead of uniqueN. – skan Jun 19 '20 at 00:23
  • 1
    @skan it depends on the dataset. Applying unique before uniqueN might be faster. Sometimes applying count twice might also be faster – chinsoon12 Jun 19 '20 at 00:33