2

I have a data.table that contains daily observations at many points in space:

#    day point_id         x         y      var1      var2       var3
# 1:   1        1 0.8179541 0.0220291 0.0903821 0.7306495 0.52508116
# 2:   1        2 0.1798340 0.8267741 0.5634569 0.6738693 0.88823133
# 3:   1        3 0.4204264 0.7223463 0.4948849 0.6911563 0.27390131
# ...

I wrote a convenience function to group by point id and summarize the values in a column. I am using get(col_name) to identify the column I want to summarize:

summarize <- function(dtable, col_name) {
  dtable[, .(
    x = mean(x),
    y = mean(y),
    min = min(get(col_name)),
    mean = mean(get(col_name)),
    max = max(get(col_name))
  ), by = point_id]
}

My function is significantly slower than specifying the column directly:

system.time(summarize(dtable, "var1"))
#  user  system elapsed
# 1.140   0.000   1.139

system.time(
  dtable[, .(
    x = mean(x),
    y = mean(y),
    min = min(var1),
    mean = mean(var1),
    max = max(var1)
  ), by = point_id]
)
#  user  system elapsed 
# 0.344   0.000   0.344 

Why is this, and what is the best way to speed up the function?

I could construct the expression as a string, substitute in the desired column name, then parse and eval but I imagine there is a better way.

Complete example:

library(data.table)

# Simulate some data
points <- data.table(
  id = 1:50000,
  x = runif(50000),
  y = runif(50000)
)
dtable <- CJ(day = 1:100, point_id = points$id)[points, on = c(point_id = "id")]
dtable[, var1 := runif(1:nrow(dtable))]
dtable[, var2 := runif(1:nrow(dtable))]
dtable[, var3 := runif(1:nrow(dtable))]
setkey(dtable, day, point_id)

# This is fast
system.time(
  dtable[, .(
    x = mean(x),
    y = mean(y),
    min = min(var1),
    mean = mean(var1),
    max = max(var1)
  ), by = point_id]
)
#  user  system elapsed 
# 0.344   0.000   0.344 

# Why is this slower?
summarize <- function(dtable, col_name) {
  dtable[, .(
    x = mean(x),
    y = mean(y),
    min = min(get(col_name)),
    mean = mean(get(col_name)),
    max = max(get(col_name))
  ), by = point_id]
}
system.time(summarize(dtable, "var1"))
#  user  system elapsed
# 1.140   0.000   1.139
ihough
  • 1,111
  • 1
  • 8
  • 8
  • 1
    Because `get` is a very expensive function. [I've asked a similar question](https://stackoverflow.com/questions/27677283/evaluating-both-column-name-and-the-target-value-within-j-expression-within-d) long ago and it seems you will be better of with `eval(as.name(col_name))` instead of `get(col_name)` – David Arenburg Jun 03 '18 at 12:46
  • 4
    @DavidArenburg, I think overhead of `get` itself is only part of the issue. In the first case, the J expression is optimized with GForce, while not in the second case (see the info with `verbose = T`). If we turn off optimization with `options(datatable.optimize = 0L)`, the fist way takes almost the same time as the second way (0.708 vs. 0.744 on my machine). – mt1022 Jun 03 '18 at 12:57
  • @mt1022 hmm good point, haven't thought of that. – David Arenburg Jun 03 '18 at 12:58
  • 1
    Thanks @DavidArenburg. FYI on my machine `eval(as.name(col_name))` is slightly slower than `get(col_name)` (1.41 vs. 1.14). I'll try to avoid either when working with a large dataset. – ihough Jun 03 '18 at 16:16

0 Answers0