Getting a performance gain out of setkey

Question

I've read the documentation vignette("datatable-intro") and a couple of online resources on the subject such as this.

But I'm still struggling to see a performance gain with setkey using data.table (although I am finding it much faster than base [ and dplyr::filter even without the key set).

Any ideas why? Is my example to small/simple to see a performance gain? Or am I doing something wrong?

I was expecting dtk, using the key, to be faster.

library(data.table)
library(microbenchmark)

df = data.frame(l = letters, n = 1:26)
df = do.call(rbind, replicate(1e4, df, FALSE))
dfdt = dfdtk = data.table(df)
setkey(dfdtk, l)

mb = microbenchmark(times = 10, unit = "s",
  base = df[df$l == "a",],
  dt = dfdt[l == "a"],
  dtl = dfdt[list("a")],
  dtk = dfdtk[list("a")]
)
plot(mb)

jangorecki · Accepted Answer · 2016-02-16T14:26:26.537

I've found 3 places to fix or improve.

= operator won't copy data.table, followed by setkey will set key on both data.tables, so copy is necessary.
ad-hoc join without key, cannot yet use index but it is planned.
dfdt[l == "a"] will build index on first try and re-use it, so benchmark without index is worth to add.

library(data.table)
library(microbenchmark)
op = options("datatable.auto.index" = TRUE) # default!

df = data.frame(l = letters, n = 1:26)
df = do.call(rbind, replicate(1e4, df, FALSE))
dfdtk = as.data.table(df)
dfdt = copy(dfdtk) # fix #1
setkeyv(dfdtk, "l")
stopifnot(
    is.null(key(dfdt)),
    key(dfdtk) == "l"
)

mb = microbenchmark(times = 10, unit = "s",
                    base = df[df$l == "a",],
                    dt = dfdt[l == "a"],
                    dtl = dfdt[list("a"), on = c("l" = "V1")], # fix #2
                    dtk = dfdtk[list("a")])
print(mb)
#Unit: seconds
# expr         min          lq         mean       median          uq         max neval
# base 0.016255351 0.017294076 0.0177331871 0.0178513590 0.018269365 0.019392296    10
#   dt 0.000792324 0.000819030 0.0011565028 0.0009645955 0.001056976 0.002278742    10
#  dtl 0.001625836 0.001657269 0.0018865184 0.0019408475 0.002009650 0.002196615    10
#  dtk 0.000566798 0.000598538 0.0007664731 0.0007530190 0.000897327 0.001008621    10

options("datatable.auto.index" = FALSE) # fix #3
stopifnot(
    key2(dfdt) == "l",
    is.null(key2(set2keyv(dfdt, NULL)))
)
mb = microbenchmark(times = 10, unit = "s",
                    base = df[df$l == "a",],
                    dt = dfdt[l == "a"],
                    dtl = dfdt[list("a"), on = c("l" = "V1")],
                    dtk = dfdtk[list("a")])
print(mb)
#Unit: seconds
# expr         min          lq         mean       median          uq         max neval
# base 0.015935139 0.017397039 0.0253407267 0.0180737620 0.019560766 0.090317493    10
#   dt 0.014194243 0.014292279 0.0153187365 0.0153102030 0.015997166 0.016689574    10
#  dtl 0.001628532 0.001774283 0.0020169391 0.0018818880 0.001935386 0.003697506    10
#  dtk 0.000556702 0.000653134 0.0006869461 0.0006898765 0.000764199 0.000780357    10

options(op)

Sorry for stopifnots, to much unit testing...

Getting a performance gain out of setkey

1 Answers1