The problem is well-known: unlike data.frame
's, where one can point to column names by character variables, the default behaviour of data.table
is to want actual column names (e.g. you cannot do DT[, "X"]
, but you must do DT[, X]
, if your table has a column named "X"
).
Which in some cases is a problem, because one wants to handle a generic dataset with arbitrary, user-defined column names.
I saw a couple of posts about this:
Pass column name in data.table using variable
Select / assign to data.table when variable names are stored in a character vector
And the official FAQ says I should use with = FALSE
:
The quote
+ eval
method, I do not really understand; and the one with ..
gave an error even before starting doing anything.
So I only compared the method using the actual column names (which I could not use in real practice), the one using get
and the one using with = FALSE
.
Interestingly, the latter, i.e. the official, recommended one, is the only one that does not work at all.
And get
, while it works, for some reason is far slower than using the actual column names, which I really don't get (no pun intended).
So I guess I am doing something wrong...
Incidentally, but importantly, I turned to data.table
because I needed to make a grouped mean of a fairly large dataset, and my previous attempts using aggregate
, by
or tapply
were either too slow, or too memory-hungry, and they crashed R.
I cannot disclose the actual data I am working with, so I made a simulated dataset of the same size here:
require(data.table)
row.var = "R"
col.var = "C"
value.var = "V"
set.seed(934293)
d <- setNames(data.frame(sample(1:758145, 7582953, replace = T), sample(1:450, 7582953, replace = T), runif(7582953, 5, 9)),
c(row.var, col.var, value.var))
DT <- as.data.table(d)
rm(m)
print(system.time({
m <- DT[, mean(V), by = .(R, C)]
}))
# user system elapsed
# 1.64 0.27 0.51
rm(m)
print(system.time({
m <- DT[, mean(get(value.var)), by = .(get(row.var), get(col.var))]
}))
# user system elapsed
# 16.05 0.02 14.97
rm(m)
print(system.time({
m <- DT[, mean(value.var), by = .(row.var, col.var), with = FALSE]
}))
#Error in h(simpleError(msg, call)) :
# error in evaluating the argument 'x' in selecting a method for function 'print': missing value #where TRUE/FALSE needed
#In addition: Warning message:
#In mean.default(value.var) :
#
# Error in h(simpleError(msg, call)) :
#error in evaluating the argument 'x' in selecting a method for function 'print': missing value #where TRUE/FALSE needed Timing stopped at: 0 0 0
Any ideas?