7

I'm wanting to run through a long vector of potential explanatory variables, regressing a response variable on each in turn. Rather than paste together the model formula, I'm thinking of using reformulate(), as demonstrated here.

The function fun() below seems to do the job, fitting the desired model. Notice, though, that it records in its call element the name of the constructed formula object rather than its value.

## (1) Function using programmatically constructed formula
fun <- function(XX) {
    ff <- reformulate(response="mpg", termlabels=XX)
    lm(ff, data=mtcars)
}
fun(XX=c("cyl", "disp"))
# 
# Call:
# lm(formula = ff, data = mtcars)                 <<<--- Note recorded call
# 
# Coefficients:
# (Intercept)          cyl         disp  
#    34.66099     -1.58728     -0.02058  

## (2) Result of directly specified formula (just for purposes of comparison)
lm(mpg ~ cyl + disp, data=mtcars)
# 
# Call:
# lm(formula = mpg ~ cyl + disp, data = mtcars)   <<<--- Note recorded call
# 
# Coefficients:
# (Intercept)          cyl         disp  
#    34.66099     -1.58728     -0.02058  

My question: Is there any danger in this? Can this become a problem if, for instance, I want to later apply update, or predict or some other function to the model fit object, (possibly from some other environment)?

A slightly more awkward alternative that does, nevertheless, get the recorded call right is to use eval(substitute()). Is this in any way a generally safer construct?

fun2 <- function(XX) {
    ff <- reformulate(response="mpg", termlabels=XX)
    eval(substitute(lm(FF, data=mtcars), list(FF=ff)))
}
fun2(XX=c("cyl", "disp"))$call
## lm(formula = mpg ~ cyl + disp, data = mtcars)
Cœur
  • 37,241
  • 25
  • 195
  • 267
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • 2
    I think the issue will being careful using `update` see [http://stackoverflow.com/questions/13690184/update-inside-a-function-only-searches-the-global-environment]. You can usually find out these pitfalls by trying to use `data.table` -- see [http://stackoverflow.com/questions/15096811/why-is-using-update-on-a-lm-inside-a-grouped-data-table-losing-its-model-data/15376891#15376891] – mnel Jun 30 '13 at 23:29
  • 3
    This works: `do.call("lm", list(ff, quote(mtcars)))` – G. Grothendieck Jun 30 '13 at 23:49
  • @G.Grothendieck -- Thanks. That looks good, and seems like it shouldn't pose any problems down the road. (Also, thanks mnel for the interesting links.) – Josh O'Brien Jun 30 '13 at 23:53
  • It is also worth looking how `add1.lm` works (using `. ~ .+` and `update.formula` and ti to remember (or notice) that reformulate is just a wrapper for the paste method. – mnel Jul 01 '13 at 00:23

1 Answers1

2

I'm always hesitant to claim there are no situations in which something involving R environments and scoping might bite, but ... after some more exploration, my first usage above does look safe.

It turns out that the printed call is a bit of red herring.

The formula that actually gets used by other functions (and the one extracted by formula() and as.formula()) is the one stored in the terms element of the fit object, and it gets the actual formula right. (The terms element contains an object of class "terms", which is just a "formula" with a bunch of attached attributes.)

To see that all of the proposals in my question and the associated comments store the same "formula" object (up to the associated environment), run the following.

## First the three approaches in my post
formula(fun(XX=c("cyl", "disp")))
# mpg ~ cyl + disp
# <environment: 0x026d2b7c>

formula(lm(mpg ~ cyl + disp, data=mtcars))
# mpg ~ cyl + disp

formula(fun2(XX=c("cyl", "disp"))$call)
# mpg ~ cyl + disp
# <environment: 0x02c4ce2c>

## Then Gabor Grothendieck's idea
XX = c("cyl", "disp")
ff <- reformulate(response="mpg", termlabels=XX)
formula(do.call("lm", list(ff, quote(mtcars))))  
## mpg ~ cyl + disp

To confirm that formula() really is deriving its output from the terms element of the fit object, have a look at stats:::formula.lm and stats:::formula.terms.

Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • Nice exploration. I've used the `do.call` method previously because of some of these same questions. It's nice to know these functions don't depend on the stored formula/call. However, I do believe that `update` does, so watch out for that usage. – Aaron left Stack Overflow Jul 04 '13 at 01:21
  • @Aaron -- Thanks. Are you referring to the same issue as mnel did in the first comment following my question? If so, am I right in understanding that that pitfall would apply equally to *any* of the calls above? (i.e. using `do.call()` to launch the call to `lm()` doesn't help you avoid `update()`'s scoping issues, right?) – Josh O'Brien Jul 04 '13 at 13:39
  • 1
    After further exploration, looks like `update` is safe too. I was sure it wasn't though, but maybe I've been wrong all this time. However, there are functions that do depend on the function being correct, for example, `anova.lme`; see http://stackoverflow.com/q/7666807/210673. – Aaron left Stack Overflow Jul 05 '13 at 04:14
  • @Aaron -- Nice catch. By pointing me back to that question, I *think* you helped me locate and fix the bit of **nlme**'s code that caused the OP's call to `anova()` to fail. ([See my comment here](http://stackoverflow.com/questions/7666807/anova-test-fails-on-lme-fits-created-with-pasted-formula/7667742#7667742) for more details.) – Josh O'Brien Jul 05 '13 at 21:50