12

I noticed that get(x) does not work in R data table when x is also a column in the same data table. See the code snippet below. This is hard to avoid completely when writing an R function which takes the data table as an input. Is this a bug in the R data.table package? Thanks!

library(data.table)

dt = data.table(x=1:3, y=2:4)

var = 'y'
x = 'y'

dt[, 3*get(var)]      # [1] 6 9 12
dt[, 3*get(x)]        # Error in get(x): invalid first argument
user9210742
  • 121
  • 1
  • 4
  • Is anyone else confused that this evaluates at all: `dt[, 3*get(var)]`? `get(var) -> "y"`; `3 * "y"`? For instance, `dt[,3*"y"]` gives an error... – CPak Jan 12 '18 at 22:02
  • 2
    Seems to me a bug in `data.table` implementation. Its very clear that `x` is referred as object and it should get preference over the column names. – MKR Jan 12 '18 at 22:52
  • Seems like a dupe of https://stackoverflow.com/questions/21658893/subsetting-data-table-using-variables-with-same-name-as-column – Rich Scriven Feb 02 '18 at 23:09
  • @RichScriven I'm not sure if this is a true duplicate, in that question the reference is evaluated in `i`, whereas this question refers to evaluation in `j`. I'm not sure, but I get the impression from reading deeper into the `data.table` documentation that the behavior could be entirely different. – Matt Summersgill Feb 02 '18 at 23:35

3 Answers3

8

In general, when there is a naming conflict between columns and variables, columns will take precedence. Since v1.10.2 (31 Jan 2017) of data.table, the preferred approach to clarify that a name is a not a column name is to use the .. prefix [1]:

When j is a symbol prefixed with .. it will be looked up in calling scope and its value taken to be column names or numbers. When you see the .. prefix think one-level-up, like the directory .. in all operating systems means the parent directory. In future the .. prefix could be made to work on all symbols apearing anywhere inside DT[...]. ...

Our main focus here which we believe .. achieves is to resolve the more common ambiguity when var is in calling scope and var is a column name too. Further, we have not forgotten that in the past we recommended prefixing the variable in calling scope with .. yourself. If you did that and ..var exists in calling scope, that still works, provided neither var exists in calling scope nor ..var exists as a column name. Please now remove the .. prefix on ..var in calling scope to tidy this up. In future data.table will start to warn/error on such usage.

In your case, you can get(..x) to force the name x to be resolved in calling scope rather than within the data.table environment:

library(data.table)

dt = data.table(x=1:3, y=2:4)

var = 'y'
x = 'y'

dt[, 3*get(var)]      # [1] 6 9 12
dt[, 3*get(x)]        # Error in get(x): invalid first argument
dt[, 3*get(..x)]      # [1]  6  9 12

The .. prefix is still somewhat experimental and thus has limited documentation, but it is mentioned briefly on the help page for data.table:

By default with=TRUE and j is evaluated within the frame of x; column names can be used as variables. In case of overlapping variables names inside dataset and in parent scope you can use double dot prefix ..cols to explicitly refer to 'cols variable parent scope and not from your dataset.

This is less a bug and more an unfortunate but natural consequence of with = T to allow using columns as variables in a data environment. Indeed, you could avoid this issue in a more base R way by using the pos or envir argument of get().

Bob
  • 451
  • 1
  • 5
  • 12
2

New Answer

Based on advice from @Frank and this section of the vignette I can't believe I hadn't read before, here's a solution to this problem that doesn't allow arbitrary code to be executed.

library(data.table)
dt = data.table(x=1:3, y=2:4)

x = "y"
ExecuteMeLater = substitute(3*x, list(x=as.symbol(x)))
dt[, eval(ExecuteMeLater)]

# [1]  6  9 12

This behavior in particular is why I prefer this solution:

x = "(system(paste0('kill ',Sys.getpid())))"
ExecuteMeLater = substitute(3*x, list(x=as.symbol(x)))
dt[, eval(ExecuteMeLater)]

#Error in eval(jsub, SDenv, parent.frame()) : 
#  object '(system(paste0('kill ',Sys.getpid())))' not found

Original Answer

Note: came across what looks like a really useful resource for questions of this nature... might be able to update with a less hacky solution at some point.

The get() behavior certainly leaves the door open for unexpected outcomes, and it appears this has been brought up in more than a few some github issues in the past. To be frankly honest I've done a decent amount of investigation but I'm still not quite following exactly what the proper usage would be.

One way you can work around it is by pasting together the expression and evaluating your function input column names outside of the data.table environment and storing it as a character.

Then, by parsing and evaluating the pre-constructed expression in the data.table environment we avoid any opportunity for a column named x within the table to take precedence over the contents of the variable x.

library(data.table)

dt = data.table(x=1:3, y=2:4)

x = 'y'
ExecuteMeLater <- paste0("3*",x)  ## "3*y"
dt[, eval(parse(text = ExecuteMeLater))]

Output:

[1]  6  9 12

Not the prettiest solution, but it's worked for me numerous times in the past.

Quick disclaimer on hypothetical doomsday scenarios possible with eval(parse(...))

There are far more in depth discussions on the dangers eval(parse(...)), but I'll avoid repeating them in full.

Theoretically you could have issues if one of your columns is named something unfortunate like "(system(paste0('kill ',Sys.getpid())))" (Do not execute that, it will kill your R session on the spot). This is probably enough of an outside chance to not lose sleep over it unless you plan on putting this in a package on CRAN.

Matt Summersgill
  • 4,054
  • 18
  • 47
  • 1
    Following up from the linked question, I'd do `expr = substitute(3*x, list(x=as.symbol(x))); dt[, eval(expr)]`. That's generally how I handle expressions in `j` -- compose them first, then use `eval`. This is also the only reliable way I know of to take advantage of GForce when `j` is composed programmatically https://stackoverflow.com/a/41619112/ – Frank Feb 15 '18 at 19:55
  • 1
    Thanks so much! I had a feeling there was a "safe" way to do this but found myself going in circles reading documentation on `eval`, `parse`, `get`, etc. without making any real headway. Will update my answer on the other question with this! – Matt Summersgill Feb 15 '18 at 20:13
  • Upvoted because even though this is no longer the idiomatic solution, you at least made an effort to protect from code injection – Bob Nov 15 '19 at 05:47
-2

This is from the R documentation for the first argument in the get function: "an object name (given as a character string)."

So dt[, 3*get("x")] should work.

LucyMLi
  • 657
  • 4
  • 14
  • 4
    I think the point is the OP has defined `x = 'y'` just as he defined `var = 'y'`. He expects the same result as with `dt[, 3*get(var)] ` but this suggestion gives a different answer. – MrFlick Jan 12 '18 at 21:52