1

I would like to edit the model matrix used by predict.lm() in R to predict main effects but not interactions (but using the coefficients and variance from the full model containing interactions).

I have tried:

data(npk) #example data
mod <- lm(yield ~ N*P*K, data=npk, x=T) #run model
newmat <- mod$x # acquire model matrix
newmat[, c(5:8)] <- 0 #set interaction terms to 0
#try to predict on the new matrix..
predict(mod, as.data.frame(newmat), type="response", interval="confidence") 

... but this returns the error 'data' must be a data.frame, not a matrix or an array because predict.lm() does not accept a model matrix.

How can I predict using the model matrix given in my example code?

(or is there a better way of predicting main effects but not interactions, using the full model yield ~ N*P*K?)

jay.sf
  • 60,139
  • 8
  • 53
  • 110
corn_bunting
  • 349
  • 1
  • 11

2 Answers2

1

We could calculate the interactions by hand; done easily by first creating the terms trms, then evaluating them in an eval(parse()) approach.

## create interaction terms 
iv <- c('N', 'P', 'K')  ## indp. vars
trms <- unlist(sapply(2:3, function(m) combn(iv, m, FUN=paste, collapse='x')))
## evaluate them to a matrix
Ia <- with(npk1, sapply(trms, function(x) eval(parse(text=gsub('x', '*', x)))))

Then just cbind and use it in lm(), compare:

## cbind
npk2 <- cbind(npk1, Ia)
## following yield the same:
(mod1 <- lm(yield ~ .^3, data=npk1))
(mod2 <- lm(yield ~ ., data=npk2, x=TRUE))

Then you may folow your approach:

newmat <- mod2$x  ## acquire model matrix
newmat[, c(5:8)] <- 0  ## set interaction terms to 0
predict(mod2, newdata=as.data.frame(newmat))  ## newdata w/ Ia to zero
# 1        2        3        4        5        6        7        8        9       10 
# 54.90000 66.66667 51.43333 64.33333 63.76667 67.23333 52.00000 54.33333 54.33333 67.23333 
# 11       12       13       14       15       16       17       18       19       20 
# 63.76667 52.00000 63.76667 67.23333 52.00000 54.33333 66.66667 51.43333 64.33333 54.90000 
# 21       22       23       24 
# 64.33333 66.66667 54.90000 51.43333 

Whereas:

predict(mod1)  ## old model
# 1        2        3        4        5        6        7        8        9       10 
# 50.50000 57.93333 51.43333 54.66667 63.76667 54.36667 52.00000 54.33333 54.33333 54.36667 
# 11       12       13       14       15       16       17       18       19       20 
# 63.76667 52.00000 63.76667 54.36667 52.00000 54.33333 57.93333 51.43333 54.66667 50.50000 
# 21       22       23       24 
# 54.66667 57.93333 50.50000 51.43333 
    

Data:

npk1 <- structure(list(N = c(0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 
0, 0, 1, 0, 1, 0, 1, 1, 0, 0), P = c(1, 1, 0, 0, 0, 1, 0, 1, 
1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0), K = c(1, 0, 
0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 
0), yield = c(49.5, 62.8, 46.8, 57, 59.8, 58.5, 55.5, 56, 62.8, 
55.8, 69.5, 55, 62, 48.8, 45.5, 44.2, 52, 51.5, 49.8, 48.8, 57.2, 
59, 53.2, 56)), row.names = c(NA, 24L), class = "data.frame")
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • @corn_bunting You are absolutely right, we need more heavier artillery here, see update. Note that I created new `npk1` data below, as I learned they are just for example and factors not wanted. – jay.sf Jun 14 '21 at 20:08
  • Brilliant, this works! So it looks like the trick is to explicitly specify every interaction as a term in the model, so when `predict.lm()` gets to `newdata` and looks for column names that exist in the model terms, it finds all of them. Thanks! – corn_bunting Jun 15 '21 at 08:44
  • Just noticed this doesn't translate so easily if there are factors with multiple levels in the model (didn't think about that when I picked the example data, oops!). If it's easy to adjust the code so it's generalisable for factors, that would be handy... if not at least I think I can make it work now I understand the relation between the model formula and `newdata` :-) – corn_bunting Jun 15 '21 at 09:14
  • 1
    @corn_bunting Have you tried to convert your factors to numeric: https://stackoverflow.com/a/3418192/6574038 – jay.sf Jun 15 '21 at 09:58
  • 1
    Ah that could also be an option. I found the `model.matrix()` function instead though - see my answer added below. Feel free to edit my code if it's clunky. – corn_bunting Jun 15 '21 at 10:02
1

Using @jay.sf's answer, I managed to also create a version that works if there are factors with multiple levels in the model:

##full model (using block as a multi-level factor):
data(npk) 
mod1 <- lm(yield ~ N*block, data=npk, x=T)

## get model formula and use it to generate the model matrix:
predgrid <- data.frame(model.matrix(mod1, data=npk)) 

## make a new dataframe using the model matrix and the response,
## and run the model using all columns in the new dataframe as terms:
npk2 <- as.data.frame(cbind(npk$yield, predgrid[, -1])) 
colnames(npk2)[1] <- "yield" 
mod2 <- lm(yield~., data=npk2)

## extract the model matrix dataframe again, to modify for predictions:
newmat <- predgrid[, -1]
colnames(newmat)
newmat[, 7:11] <- 0

## predict on modified matrix dataframe:
pred <- predict(mod2, newdata=newmat, type="response", interval="confidence")
head(pred) ##
#    fit      lwr      upr
#1 48.15 41.18475 55.11525
#2 59.90 52.93475 66.86525
#3 48.15 41.18475 55.11525
#4 59.90 52.93475 66.86525
#5 67.50 55.43584 79.56416
#6 67.50 55.43584 79.56416

corn_bunting
  • 349
  • 1
  • 11
  • 1
    Nice solution, I just borrowed you my space bar ;) – jay.sf Jun 15 '21 at 10:07
  • 2
    In case this is useful to anyone: the `<- 0` can be replaced with `lapply(newmat[, 7:11],FUN=mean)` to average across interactions instead of ignore them. – corn_bunting Jun 15 '21 at 15:10