4

I have gotten in the habit of accessing data.table columns in j even when I do not need to:

require(data.table)
set.seed(1); n = 10
DT <- data.table(x=rnorm(n),y=rnorm(n))

frm <- formula(x~y)

DT[,lm(x~y)]         # 1 works
DT[,lm(frm)]         # 2 fails
lm(frm,data=DT)      # 3 what I'll do instead

I expected # 2 to work, since lm should search for variables in DT and then in the global environment... Is there an elegant way to get something like # 2 to work?

In this case, I'm using lm, which takes a "data" argument, so # 3 works just fine.

EDIT. Note that this works:

x1 <- DT$x
y1 <- DT$y
frm1 <- formula(x1~y1)
lm(frm1)

and this, too:

rm(x1,y1)
bah <- function(){
    x1 <- DT$x
    y1 <- DT$y
    frm1 <- formula(x1~y1)
    lm(frm1)
}
bah()

EDIT2. However, this fails, illustrating @eddi's answer

frm1 <- formula(x1~y1)
bah1 <- function(){
    x1 <- DT$x
    y1 <- DT$y
    lm(frm1)
}
bah1()
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 4
    try `DT[, lm(frm, .SD)]` ([see here](http://stackoverflow.com/questions/16232138/r-creating-models-on-subsets-with-data-table-inside-a-function/16232785#16232785)). – Arun Oct 11 '13 at 06:52
  • Thanks, @Arun. That clued me in to the search I should have done before posting. Looks like this has been answered a couple other times besides: http://stackoverflow.com/questions/14784048/create-a-formula-in-a-data-table-environment-in-r http://stackoverflow.com/questions/19001792/r-using-glm-inside-a-data-table I guess I should mark this as a duplicate or delete it..I'll decide when I wake up. I was hoping for some way to alter the formula so that `lm` looked in the right place (instead of using the data= argument, which seems clumsy *inside the data.table*), but it looks like there is none. – Frank Oct 11 '13 at 07:00
  • 3
    Not sure whether this fits the bill in your real use case, but the following does work `m <- quote(lm(x~y)); DT[,eval(m)]`. – Josh O'Brien Oct 11 '13 at 07:02
  • 1
    it fails because x and y are not used in your `j`, and because normally `data.table` only constructs those columns that it can detect are being used (and because it can only be that smart before becoming omnipotent;)) - I haven't tried it yet, but I think if you even do `{x; y; lm(frm)}` it'll work. Maybe there is an FR here to use `.SDcols` to indicate columns used even when there is no `.SD` – eddi Oct 11 '13 at 12:50
  • @eddi Well, they're sort of in my `j`, in the sense that my `j` goes looking for them...I tried what you suggested and also `DT[,{x <- x;y <- y;lm(frm)}]`, but no dice. Yeah, it would be nice to have the option of using .SDcols without .SD. – Frank Oct 11 '13 at 14:36
  • 1
    `j` going looking for them is a very abstract statement, one that requires an AI-level understanding of what your `j` does. I'm a bit surprised that explicitly using `x` and `y` didn't do it, I must be misunderstanding the underlying issue. Re FR: please add it. – eddi Oct 11 '13 at 14:52
  • 1
    ah, I see now, this is an `lm`-specific issue - it's looking for those variables in the environment of the formula, so adding the variables to the `j`-environment (by using them somehow) doesn't help, as the formula environment is `.GlobalEnv` here. – eddi Oct 11 '13 at 15:07
  • @eddi Yeah, I'm inclined to blame `lm` for this somehow, but I'm not sure exactly where it's falling short. I've edited in a couple examples above. – Frank Oct 11 '13 at 15:13

1 Answers1

4

The way lm works it looks for the variables used in the environment of the formula supplied. Since you create your formula in the global environment, it's not going to look in the j-expression environment, so the only way to make the exact expression lm(frm) work would be to add the appropriate variables to the correct environment:

DT[, {assign('x', x, environment(frm));
      assign('y', y, environment(frm));
      lm(frm)}]

Now obviously this is not a very good solution, and both Arun's and Josh's suggestions are much better and I'm just putting it here for the understanding of the problem at hand.

edit Another (possibly more perverted, and quite fragile) way would be to change the environment of the formula at hand (I do it permanently here, but you could revert it back, or copy it and then do it):

DT[, {setattr(frm, '.Environment', get('SDenv', parent.frame(2))); lm(frm)}]

Btw a funny thing is happening here - whenever you use get in j-expression, all of the variables get constructed (so don't use it if you can avoid it), and this is why I don't need to also use x and y in some way for data.table to know that those variables are needed.

eddi
  • 49,088
  • 6
  • 104
  • 155
  • Not a solution anyone would choose to use, but pretty much just what I was looking for. Thanks! – Frank Oct 11 '13 at 15:28
  • One example of a function that works, while the analogous j evaluation does not: `expr <- expression(x1~y1); bah2 <- function(){ x1 <- DT$x; y1 <- DT$y; lm(eval(expr)); }; bah2(); DT[,lm(eval(expr))]` – Frank Oct 11 '13 at 15:33
  • 1
    @Frank if you download the latest version, this will work: `expr <- quote(x ~ y); DT[, lm(eval(expr))]` - see this: http://stackoverflow.com/questions/15913832/eval-and-quote-in-data-table – eddi Oct 11 '13 at 15:39