Estimate equation with varying numbers of parameters in R (lm) with demeaned variables

Question

I want to estimate an equation such as:

$Y = \alpha + \delta Z + \beta (X - \bar{X}) + \gamma Z (X - \bar{X})$

(where the bar denotes the mean of a variable.... Meaning, I want to automatically have interactions between Z and a demeaned version of X. So far I just demean the variables manually beforehand and estimate:

lm(Y ~ .*Z, data= sdata)

This seems to be working, but I would rather use a solution that does not require manual demeaning beforehand because I would also like to include the means of more complex terms, such as:

$Y = \alpha + \delta Z + \beta (A*X - \bar{A*X}) + \gamma Z (A*X - \bar{A*X})$

Edit: As requested, a working code-sample, note that in the actual thing I have large (and varying) numbers of X- variables, so that I dont want to use a hard-coded variant:

x1 <- runif(100)
x2 <- runif(100)
Z  <- runif(100)
Y  <- exp(x1) + exp(x2) + exp(z)

##current way of estimating the first equation:
sdata <- data.frame(Y=Y,Z=Z,x1=x1-mean(x1),x2=x2-mean(x2))
lm(Y ~ .*Z, data= sdata)

##basically what I want is that the following terms, and their interactions with Z are also used: 
#  X1^2 - mean(X1^2)
#  X2^2 - mean(X2^2)
#  X1*X2 - mean(X1*X2)

Edit 2: Now, what I want to achieve is basically what

lm(Y ~ .^2*Z, data= sdata)

would do. However, given prior demeaing expressions in there, such as: Z:X1:X2 would correspond to: (x1-mean(x1))*(x2-mean(x2)), while what I want to have is x1*x2-mean(x1*x2)

If would be easier to help if you provided a simple [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) which we can use for testing. — MrFlick, May 23 '16 at 15:55
So why don't you just put `mean(x)` into your formula? You may find it easier to use the `I(x*mean(x))` (capital "i" ) approach. See `?formula` — Carl Witthoft, May 23 '16 at 16:05
@Carl, I don't see how the `I` approach is going to help me with dynamically including interactions of Z with all second order polinomials of the `X*` variables — sheß, May 23 '16 at 16:18
Right now I don't see how your example works programmatically without mindreading. Do you wan the code to assume that all predictors with that are not named `Z` should be demeaned? Maybe it would work better to have a character vector of variables to demean or to not demean... or a formula involving just those variables, or something. — Gregor Thomas, May 23 '16 at 16:24
The code offered errors out. `X != x`. And did you really mean to define Y in terms of Y? There is a `scale` function that could be used to "demean" an entire dataframe. — IRTFM, May 23 '16 at 18:20
I made one more edit to clarify the question and to fix the code. @coffeinjunky, as far as I understand `scale` doesn't work inside the formula, hence I will either have to hard-code the solution or work with loops. I was trying to get a solution without this. — sheß, May 25 '16 at 09:31

coffeinjunky · Answer 1 · 2016-05-25T11:19:30.537

To show that scale works inside a formula:

lm(mpg ~ cyl + scale(disp*hp, scale=F), data=mtcars)

Call:
lm(formula = mpg ~ cyl + scale(disp * hp, scale = F), data = mtcars)

Coefficients:
                (Intercept)                          cyl  scale(disp * hp, scale = F)  
                  3.312e+01                   -2.105e+00                   -4.642e-05

Now for comparison let's scale the interaction outside the formula:

mtcars$scaled_interaction <- with(mtcars, scale(disp*hp, scale=F))

lm(mpg ~ cyl + scaled_interaction, data=mtcars)

Call:
lm(formula = mpg ~ cyl + scaled_interaction, data = mtcars)

Coefficients:
       (Intercept)                 cyl  scaled_interaction  
         3.312e+01          -2.105e+00          -4.642e-05

At least in these examples, it seems as if scale inside formulae is working.

To provide a solution to your specific issue:

Alternative 1: Use formulae

# fit without Z
mod <- lm(Y ~ (.)^2, data= sdata[, names(sdata) != "Z" ])
vars <- attr(mod$terms, "term.labels")
vars <- gsub(":", "*", vars) # needed so that scale works later
vars <- paste0("scale(", vars, ", scale=F)")
newf <- as.formula(paste0("Y ~ ", paste0(vars, collapse = "+")))
# now interact with Z
f2 <- update.formula(newf, . ~ .*Z)
# This fives the following formula:
f2  
  Y ~ scale(x1, scale = F) + scale(x2, scale = F) + scale(x1*x2, scale = F) + 
  Z + scale(x1, scale = F):Z + scale(x2, scale = F):Z + scale(x1*x2, scale = F):Z

Alternative 2: Use Model Matrices

# again fit without Z and get model matrix
mod <- lm(Y ~ (.)^2, data= sdata[, names(sdata) != "Z" ])
modmat <- apply(model.matrix(mod), 2, function(x) scale(x, scale=F))

Here, all x's and the interactions are demeaned:

> head(modmat)
     (Intercept)         x1          x2       x1:x2
[1,]           0  0.1042908 -0.08989091 -0.01095459
[2,]           0  0.1611867 -0.32677059 -0.05425087
[3,]           0  0.2206845  0.29820499  0.06422944
[4,]           0  0.3462069 -0.15636463 -0.05571430
[5,]           0  0.3194451 -0.38668844 -0.12510551
[6,]           0 -0.4708222 -0.32502269  0.15144812

> round(colMeans(modmat), 2)
(Intercept)          x1          x2       x1:x2 
          0           0           0           0

You can use the model matrix as follows:

modmat <- modmat[, -1] # remove intercept
lm(sdata$Y ~ modmat*sdata$Z)

It is not beautiful, but should do the work with any number of explanatory variables. You can also add Y and Z to the matrix so that the output looks prettier if this is a concern. Note that you can also create the model matrix directly without fitting the model. I took it from the fitted model directly since it have already fitted it for the first approach.

As a sidenote, it may be that this is not implemented in a more straight forward fashion because it is difficult to imagine situations in which demeaning the interaction is more desirable compared to the interaction of demeaned variables.

Comparing both approaches:

Here the output of both approaches for comparison. As you can see, apart from the coefficient names everything is identical.

> lm(sdata$Y ~ modmat*sdata$Z)

Call:
lm(formula = sdata$Y ~ modmat * sdata$Z)

Coefficients:
        (Intercept)             modmatx1             modmatx2          modmatx1:x2              sdata$Z  
            4.33105              1.56455              1.43979             -0.09206              1.72901  
   modmatx1:sdata$Z     modmatx2:sdata$Z  modmatx1:x2:sdata$Z  
            0.25332              0.38155             -0.66292  

> lm(f2, data=sdata)

Call:
lm(formula = f2, data = sdata)

Coefficients:
                (Intercept)         scale(x1, scale = F)         scale(x2, scale = F)  
                    4.33105                      1.56455                      1.43979  
  scale(x1 * x2, scale = F)                            Z       scale(x1, scale = F):Z  
                   -0.09206                      1.72901                      0.25332  
     scale(x2, scale = F):Z  scale(x1 * x2, scale = F):Z  
                    0.38155                     -0.66292

Thank you so much. But I still don't see how this get's me around using loops. as I can't seem to do things like: `Y~ Z + scale(.^2)*Z ` but will instead have to type `Y~ Z + scale(x1*x2)*Z + scale(x2*x2)*Z scale(x1*x1)*Z`, which will be probelmatic once I have not only `x1` and `x2`, but a variable number of variables — sheß, May 25 '16 at 09:45
Try looking into `update.formula`. I could imagine a workaround like `f <- Y ~ .` and `update.formula(f, . ~ scale(.)*Z)`. This will have to be tweaked a bit, though, to get the interactions going. — coffeinjunky, May 25 '16 at 09:53

Estimate equation with varying numbers of parameters in R (lm) with demeaned variables

1 Answers1