2

I'm trying to do some simple filtering on an existing data table object that I don't really want to modify. I noticed that it seems dramatically slower than the base R equivalent, as in the following example:

library(data.table)
dt <- data.table(
  label = sample(c("l1", "l2", "l3"), 300, replace = TRUE),
  x = rnorm(300)
)
df <- as.data.frame(dt)
all.equal(dt[label == "l3", x], df[df$label == "l3", "x"])

(bench <- microbenchmark::microbenchmark(
  dt_lookup = dt[label == "l3", x],
  df_lookup = df[df$label == "l3", "x"],
  mixed = dt[dt$label == "l3", "x"],
  times = 1000
))

which yields

Unit: microseconds
      expr    min     lq      mean median     uq    max neval
 dt_lookup 1159.2 1393.0 1529.4477 1451.6 1524.2 6487.9  1000
 df_lookup   17.3   25.2   33.8164   32.0   36.4  150.4  1000
     mixed  140.9  175.2  204.8512  193.9  220.7 1533.9  1000

That is, base R is more than 30 times faster.

Am I doing something wrong here? Indices and keys do not seem to have much effect on the performance in this case.

Atheriel
  • 61
  • 4
  • 4
    Differences become smaller in much larger tables, so presumbably this is by caused overhead for the parsing of the more feature-rich data.table syntax. But does it really matter when timings are in the milisecond range? – Axeman Nov 28 '19 at 19:08
  • 2
    Set key on your data table. But also strongly agree with Axeman - generate data big enough for a time difference to matter to test time differences. When I switch your example to 5M rows, data table is 3x faster. – Gregor Thomas Nov 28 '19 at 19:18
  • The size of the data in the example is the size of my actual lookup table in this case, so better scaling from data.table is not all that useful. Is there some way to avoid the overhead? And to answer your question -- does 1 millisecond matter -- the answer is emphatically "yes" in my case. – Atheriel Nov 28 '19 at 19:31
  • 2
    Anytime you ```data.table[...]```, there's a performance cost because there's a lot of overhead. You can do what Arun suggests for similar speed or you could refactor do try to minimize the times you need to call ```[``` – Cole Nov 28 '19 at 20:28
  • 1
    See also https://stackoverflow.com/questions/56399822/why-do-parentheses-slow-down-my-program-in-r and probably more relevant https://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r specifically the performance lost by extracting vectors from data.frames in loops. – Cole Nov 28 '19 at 20:36

2 Answers2

5

Instead of subsetting the data.frame, the column can be extracted first with [[ and then subset the rows

df[["x"]][df[["label"]] == "l3"]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Note that this is equivalent to bypassing `data.table` entirely by manipulating the underlying vectors, but it does work. – Atheriel Nov 28 '19 at 21:04
0

You can use a better data structure such as a list. In your example you don't need to go through the entire vector each time if the lookup table is hierarchically structured:

tb = list(l1 = rnorm(300),
          l2 = rnorm(300),
          l3 = rnorm(300))

tb[["l3"]] # accessed in 1 micro sec
JRR
  • 3,024
  • 2
  • 13
  • 37
  • As I said in the question, I don't want to modify the original object, so a different structure is not really the solution I'm looking for. – Atheriel Nov 28 '19 at 20:50
  • As you want, but if speed matters so much an adequate data structure is the solution. Other solutions are trade-off. – JRR Nov 28 '19 at 21:07