3

I am trying to create generic function to handle a data frame with multiple plausible values. What I want is to pass a formula to a function to perform a regression such as:

f <- MRPCM ~ DSEX + IEP + ELL3 + SDRACEM + PARED

The MRPCM variable does not actually exist in the data frame. Instead five variables, MRPCM1, MRPCM2, MRPCM3, MRPCM4, and MRPCM5 do exist. What I want to do is iterate and update the formula (f here) to create five formulas. Can this be done? The update.formula function seems to work on the entire left or right side at a time. I should also note that in this example the variable I wish to change is the dependent variable so that update(f, MRPCM1 ~ .) works. However, I will not know where the variable appears in the formula.

For example:

f <- MRPCM + DSEX ~ IEP + ELL3 + SDRACEM + PARED

update.formula(f, as.formula('MRPCM1 ~ .'))

Results in this (note that DSEX is missing now):

MRPCM1 ~ IEP + ELL3 + SDRACEM + PARED

jbryer
  • 1,747
  • 3
  • 16
  • 29
  • where do the plausible variable names come from? – Chase Jul 16 '12 at 19:47
  • 1
    Also, see the examples under `?as.formula` to paste together character strings and turn those into formulas. In theory that should probably do what you want - make your question [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and we'll find out for sure. – Chase Jul 16 '12 at 19:49

1 Answers1

6

Here's a demonstration of one approach. A more sophisticated implementation might instead update the fitted linear model (see ?update), but that goes beyond the immediate scope of your question.

## Make a reproducible example!!
df <- 
setNames(as.data.frame(matrix(rnorm(96), ncol=8)), 
         c("MRPCM1","MRPCM2","MRPCM3","DSEX","IEP", "ELL3","SDRACEM","PARED"))

## Construct a template formula
f <- MRPCM ~ DSEX + IEP + ELL3 + SDRACEM + PARED

## Workhorse function
iterlm <- function(formula, data) {
    ## Find columns in data matching pattern on left hand side of formula
    LHSpat <- deparse(formula[[2]])
    LHSvars <- grep(LHSpat, names(data), value = TRUE)
    ## Run through matchded columns, repeatedly updating the formula,
    ## fitting linear model, and extracting whatever results you want. 
    sapply(LHSvars, FUN=function(var) {
        uf <- update.formula(f, as.formula(paste(var, "~ .")))
        coef(lm(uf, df))
    })
}

## Try it
iterlm(f, df)
##                  MRPCM1     MRPCM2      MRPCM3
## (Intercept)  0.71638942 -0.3883355  0.22202700
## DSEX        -0.07048994 -0.7478064  0.62590580
## IEP         -0.22716821 -0.2381982  0.12205780
## ELL3        -0.44492392  0.1720344  0.41251561
## SDRACEM      0.21629235  0.4800773  0.02866802
## PARED        0.07885683 -0.2582598 -0.07996121
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • Thanks Josh, this is almost correct. This doesn't work if there is more than one variable with the variable to be replaced. See the edit to the original question above. – jbryer Jul 16 '12 at 23:41
  • @jbryer -- Well, that's become a different question now, and one I'd approach differently. I won't work on it, but here are a few further hints for you to play around with. (1) Put a `browser()` call in the first line of `iterlm()`. Then try `iterlm()` out with a formula having several variables on the LHS. (2) Try `f[[2]][[2]]`, and `f[[2]][[2]] <- as.symbol("randomString")` to see how you can modify the first variable in the formula. (3) Try an `if(length(f[[2]]) > 1) {}` construct to learn how you might write a function that handles formulae with 1 or more variables on the `LHS`. – Josh O'Brien Jul 17 '12 at 00:02