53

In the minimal example below, I am trying to use the values of a character string vars in a regression formula. However, I am only able to pass the string of variable names ("v2+v3+v4") to the formula, not the real meaning of this string (e.g., "v2" is dat$v2).

I know there are better ways to run the regression (e.g., lm(v1 ~ v2 + v3 + v4, data=dat)). My situation is more complex, and I am trying to figure out how to use a character string in a formula. Any thoughts?

Updated below code

# minimal example 
# create data frame
v1 <- rnorm(10)
v2 <- sample(c(0,1), 10, replace=TRUE)
v3 <- rnorm(10)
v4 <- rnorm(10)
dat <- cbind(v1, v2, v3, v4)
dat <- as.data.frame(dat)

# create objects of column names
c.2 <- colnames(dat)[2]
c.3 <- colnames(dat)[3]
c.4 <- colnames(dat)[4]

# shortcut to get to the type of object my full code produces
vars <- paste(c.2, c.3, c.4, sep="+")

### TRYING TO SOLVE FROM THIS POINT:
print(vars)
# [1] "v2+v3+v4"

# use vars in regression
regression <- paste0("v1", " ~ ", vars)
m1 <- lm(as.formula(regression), data=dat)

Update: @Arun was correct about the missing "" on v1 in the first example. This fixed my example, but I was still having problems with my real code. In the code chunk below, I adapted my example to better reflect my actual code. I chose to create a simpler example at first thinking that the problem was the string vars.

Here's an example that does not work :) Uses the same data frame dat created above.

dv <- colnames(dat)[1]
r2 <- colnames(dat)[2]
# the following loop creates objects r3, r4, r5, and r6
# r5 and r6 are interaction terms
for (v in 3:4) {
  r <- colnames(dat)[v]
  assign(paste("r",v,sep=""),r)
  r <- paste(colnames(dat)[2], colnames(dat)[v], sep="*")
  assign(paste("r",v+2,sep=""),r)
}

# combine r3, r4, r5, and r6 then collapse and remove trailing +
vars2 <- sapply(3:6, function(i) { 
                paste0("r", i, "+")
                })
vars2 <- paste(vars2, collapse = '')
vars2 <- substr(vars2, 1, nchar(vars2)-1)

# concatenate dv, r2 (as a factor), and vars into `eq`
eq <- paste0(dv, " ~ factor(",r2,") +", vars2)

Here is the issue:

print(eq)
# [1] "v1 ~ factor(v2) +r3+r4+r5+r6"

Unlike regression in the first example, eq does not bring in the column names (e.g., v3). The object names (e.g., r3) are retained. As such, the following lm() command does not work.

m2 <- lm(as.formula(eq), data=dat)
smci
  • 32,567
  • 20
  • 113
  • 146
Eric Green
  • 7,385
  • 11
  • 56
  • 102
  • 5
    I think you mean: `paste0("v1", " ~ ", vars)`. – Arun Jun 10 '13 at 13:12
  • thanks, @Arun. You are right. Now my example runs, but I am still getting an error on my real script. Things look to be the same, but I know I must be making an error. I will keep checking and post again if I can figure out the difference. – Eric Green Jun 10 '13 at 13:32
  • 1
    In the example you posted, `regressors` is not found. It'd be nice if you could edit your post to provide an example which gives the error now. – Arun Jun 10 '13 at 13:39
  • `lm(v1 ~ factor(v2) +r3+r4+r5+r6, data=dat)` doesn't work either. I think you meant to paste together the contents of `r3` etc rather than the names. – Aaron left Stack Overflow Jun 10 '13 at 14:17
  • @Aaron: Yeah, I was not able to figure out how to get `eq` to return the contents of `r3` (i.e., `v3`) etc. `eq` just gives the string `r3+r4+r5+r6`. – Eric Green Jun 10 '13 at 15:11
  • To do that, you'd need `get`. See edit. (Though note that other methods are usually preferred to `assign` and `get`.) – Aaron left Stack Overflow Jun 10 '13 at 20:23
  • Fantastic, @Aaron. Thanks for the lapply tip. I am slowly giving up loops in favor of apply. Slowly. – Eric Green Jun 10 '13 at 20:48

2 Answers2

69

I see a couple issues going on here. First, and I don't think this is causing any trouble, but let's make your data frame in one step so you don't have v1 through v4 floating around both in the global environment as well as in the data frame. Second, let's just make v2 a factor here so that we won't have to deal with making it a factor later.

dat <- data.frame(v1 = rnorm(10),
                  v2 = factor(sample(c(0,1), 10, replace=TRUE)),
                  v3 = rnorm(10),
                  v4 = rnorm(10) )

Part One Now, for your first part, it looks like this is what you want:

lm(v1 ~ v2 + v3 + v4, data=dat)

Here's a simpler way to do that, though you still have to specify the response variable.

lm(v1 ~ ., data=dat)

Alternatively, you certainly can build up the function with paste and call lm on it.

f <- paste(names(dat)[1], "~", paste(names(dat)[-1], collapse=" + "))
# "v1 ~ v2 + v3 + v4"
lm(f, data=dat)

However, my preference in these situations is to use do.call, which evaluates expressions before passing them to the function; this makes the resulting object more suitable for calling functions like update on. Compare the call part of the output.

do.call("lm", list(as.formula(f), data=as.name("dat")))

Part Two About your second part, it looks like this is what you're going for:

lm(factor(v2) + v3 + v4 + v2*v3 + v2*v4, data=dat)

First, because v2 is a factor in the data frame, we don't need that part, and secondly, this can be simplified further by better using R's methods for using arithmetical operations to create interactions, like this.

lm(v1 ~ v2*(v3 + v4), data=dat)

I'd then simply create the function using paste; the loop with assign, even in the larger case, is probably not a good idea.

f <- paste(names(dat)[1], "~", names(dat)[2], "* (", 
           paste(names(dat)[-c(1:2)], collapse=" + "), ")")
# "v1 ~ v2 * ( v3 + v4 )"

It can then be called using either lm directly or with do.call.

lm(f, data=dat)
do.call("lm", list(as.formula(f), data=as.name("dat")))

About your code The problem you had with trying to use r3 etc was that you wanted the contents of the variable r3, not the value r3. To get the value, you need get, like this, and then you'd collapse the values together with paste.

vars <- sapply(paste0("r", 3:6), get)
paste(vars, collapse=" + ")

However, a better way would be to avoid assign and just build a vector of the terms you want, like this.

vars <- NULL
for (v in 3:4) {
  vars <- c(vars, colnames(dat)[v], paste(colnames(dat)[2], 
                                          colnames(dat)[v], sep="*"))
}
paste(vars, collapse=" + ")

A more R-like solution would be to use lapply:

vars <- unlist(lapply(colnames(dat)[3:4], 
                      function(x) c(x, paste(colnames(dat)[2], x, sep="*"))))
Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
  • Very helpful and educational. Thanks! I'm still unsure about how to get my string to work in the formula, but you've given me an approach that will get me around the problem. Thanks for taking the time to reply. You really helped to clean up my code. – Eric Green Jun 10 '13 at 15:22
  • This was very useful - it helped me understand the `lm` function better. Shame that `do.call` has such a verbose syntax. – metakermit Feb 05 '14 at 14:41
  • Meaning that the arguments need to be in a list? Perhaps, but it makes things like `do.call(rbind, lapply(foo, fun))` slick. – Aaron left Stack Overflow Feb 05 '14 at 17:04
  • 1
    Stumbled across the Q/A...`do.call` is a function I've always been looking for! Thanks! – theforestecologist Feb 29 '16 at 21:51
6

TL;DR: use paste.

create_ctree <- function(col){
    myFormula <- paste(col, "~.", collapse="")
    ctree(myFormula, data)
}
create_ctree("class")
Travis Heeter
  • 13,002
  • 13
  • 87
  • 129