I have a data.table that contains daily observations at many points in space:
# day point_id x y var1 var2 var3
# 1: 1 1 0.8179541 0.0220291 0.0903821 0.7306495 0.52508116
# 2: 1 2 0.1798340 0.8267741 0.5634569 0.6738693 0.88823133
# 3: 1 3 0.4204264 0.7223463 0.4948849 0.6911563 0.27390131
# ...
I wrote a convenience function to group by point id and summarize the values in a column. I am using get(col_name)
to identify the column I want to summarize:
summarize <- function(dtable, col_name) {
dtable[, .(
x = mean(x),
y = mean(y),
min = min(get(col_name)),
mean = mean(get(col_name)),
max = max(get(col_name))
), by = point_id]
}
My function is significantly slower than specifying the column directly:
system.time(summarize(dtable, "var1"))
# user system elapsed
# 1.140 0.000 1.139
system.time(
dtable[, .(
x = mean(x),
y = mean(y),
min = min(var1),
mean = mean(var1),
max = max(var1)
), by = point_id]
)
# user system elapsed
# 0.344 0.000 0.344
Why is this, and what is the best way to speed up the function?
I could construct the expression as a string, substitute in the desired column name, then parse
and eval
but I imagine there is a better way.
Complete example:
library(data.table)
# Simulate some data
points <- data.table(
id = 1:50000,
x = runif(50000),
y = runif(50000)
)
dtable <- CJ(day = 1:100, point_id = points$id)[points, on = c(point_id = "id")]
dtable[, var1 := runif(1:nrow(dtable))]
dtable[, var2 := runif(1:nrow(dtable))]
dtable[, var3 := runif(1:nrow(dtable))]
setkey(dtable, day, point_id)
# This is fast
system.time(
dtable[, .(
x = mean(x),
y = mean(y),
min = min(var1),
mean = mean(var1),
max = max(var1)
), by = point_id]
)
# user system elapsed
# 0.344 0.000 0.344
# Why is this slower?
summarize <- function(dtable, col_name) {
dtable[, .(
x = mean(x),
y = mean(y),
min = min(get(col_name)),
mean = mean(get(col_name)),
max = max(get(col_name))
), by = point_id]
}
system.time(summarize(dtable, "var1"))
# user system elapsed
# 1.140 0.000 1.139