0

I have some questions about substituting names in expressions by strings in a consistent ways across different functions From the dataframe

sample_df <- data.frame(a = 1:5, b = 5:1, c = c(5, 3, 1, 4, 1))
  • In lm, I can use different commands to substitute a regressor by a string in the formula

    lm(a~get("b"),sample_df) # substituting a part of a formula
    lm(a~eval(as.name("b")),sample_df) # substituting a part of a formula
    lm(substitute(a~v,list(v=as.name("b"))),sample_df) # substituting the whole formula
    lm(eval(substitute(a~v,list(v=as.name("b"))),sample_df)) # substituting the whole formula
    eval(substitute(lm(a~v,sample_df),list(v=as.name("b")))) # substituting the whole call
    

    What are the differences between all these commands? I can see the first two command gives a regressor named respectively get("b") and eval(as.name("b")) while the others give b. Are there other (maybe more subtle/problematic) differences? Why is eval irrelevant between 3 and 4?

  • In data.table, all works like lm

    sample_dt=as.data.table(sample_df)
    sample_dt[,mean:=mean(get("b"))]
    sample_dt[,eval(substitute(mean:=mean(v),list(v=as.name("b"))))]
    eval(substitute( sample_dt[,mean:=mean(v)],list(v=as.name("b"))))
    
  • Now, trying to substitute a name by a string in dplyr

    sample_df %>% mutate(mean=mean(get("b")))
    eval(substitute(sample_df %>% mutate(mean=mean(v)),list(v=as.name("b"))))
    

    The first looks for an object in the global environment while the second works. How could I predict get would not work here while it works in lm and [.data.table ?

Matthew
  • 2,628
  • 1
  • 20
  • 35
  • Why not `lm(a~b,sample.df)` ? that's what's suggested on the help page. – Carl Witthoft Sep 14 '14 at 16:14
  • ahah. It's always the same problem when coming up with the simplest example. I really want to substitute by a string - let's say I want to loop on different regressors using their names. – Matthew Sep 14 '14 at 16:23
  • So you are going to do something like (for x in c("b","c")) lm(a~as.name(x),data.frame)` ? Plus you want a "nice" output name for use in thing like `predict.lm` ? – Carl Witthoft Sep 14 '14 at 16:32
  • I would like the exact same output whether I use directly b instead of "b" indeed. Let's say I want to rewrite things I often do as functions that take a dataframe and a variable name as an argument. – Matthew Sep 14 '14 at 16:39

1 Answers1

3

You are setting up your test cases incorrectly for the purpose that was described. You want to pass in various values with a variable that contains the character value:

sample_df <- data.frame(a = 1:5, b = 5:1, c = c(5, 3, 1, 4, 1))
x <- "b"
lm(a~get(x),sample_df) # succeeds
lm(a~eval(as.name(x)),sample_df)  # also succeeds

The more typical way of doing this is to use as.formula outside the lm() call:

form <- as.formula(paste("a ~", x))
form
#a ~ b
lm(form,sample_df)
predict(lm(form,sample_df))
1 2 3 4 5 
1 2 3 4 5 

The advantage of doing this outside the lm() function is that the substitutions are completed before the recording of the call by the lm proceesing facilities. Compare the output of:

terms(lm(form,sample_df))
terms( lm(a~eval(as.name(x)),sample_df))

It will take lot of gymnastic "computing on the language" to get back to quote(b) from that second example whereas it is really easy to get the RHS from the terms()-object if a formula object was passed in:

> terms(lm(form,sample_df))[[3]]
b
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I see your point. But what if I want to substitute an expression that is *not* a formula, like in data.table or dplyr? Is deparse really the way to go? – Matthew Sep 14 '14 at 18:31
  • First thing to do would be to get your terminology in a form that conforms to R usage so we can have an unambiguous discussion. At the moment I cannot tell what you are proposing. In R an expression is not a formula, and a formula is not an expression. – IRTFM Sep 14 '14 at 18:39
  • Then I'm asking how to use your method in a consistent way across formula and expressions - in my example `lm(v2~v1,DT)` and `DT[,mean:=mean(v1)]` – Matthew Sep 14 '14 at 18:47
  • The question seems impossibly unfocussed. I cannt figure out which sort of evaluation environment you are examining. Is this the question/answer addressing the dplyr portion of your uncertainty: http://stackoverflow.com/questions/22005419/dplyr-without-hard-coding-the-variable-names ? – IRTFM Sep 14 '14 at 19:05
  • This is exactly what I do in `dplyr` (second command). This solution works across all examples in `lm` `data.table` and `dplyr`. – Matthew Sep 14 '14 at 19:07
  • The second answer (not preferred by @hadley) showed the use of setting an environment in `get` which was your first dplyr request. – IRTFM Sep 14 '14 at 19:11
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/61217/discussion-between-matthew-and-bondeddust). – Matthew Sep 14 '14 at 19:15