Why are my linear regression models giving me different prediction results in r?

Question

I was playing around with linear regression models in r, specifically taking the log of my variables and then making predictions off of the model. I ran into a somewhat minor issue but I'm curious as to what is happening. For simplicity, say I have one variable and the response. I take the log of both variables, but I format them in the following ways:

m1<-lm(log(response)~log(variable))

log_response<- log(response)
log_variable<- log(variable)
m2<- lm(log_response~log_variable)

Both model summaries produce the same output so I would assume the 2 models are equivalent. However, when I try to make a prediction, I get an error with m2.

newdata<-data.frame(variable=2)
predict(m1, newdata, interval="predict")
predict(m2, newdata, interval="predict")

Using that, the prediction for m1 will produce an accurate output, but m2 will return an error that looks like

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : variable lengths differ (found for 'log_variable') In addition: Warning message: 'newdata' had 1 row but variables found have 805 rows

Am I making some mistake in creating the log variables?

[See here](https://stackoverflow.com/q/5963269/5325862) on making a reproducible example that is easier for folks to help with. It's really hard to know what's going on without any data that would allow us to build the same models, but my guess is that it's because you're trying to predict values for a variable named `variable`, which isn't in your `m2` (it's called `log_variable`. If that's the case, this is a typo — camille, Jan 28 '22 at 21:52

score 3 · Accepted Answer · answered Jan 28 '22 at 22:00

The error you provide is very clear. You need to specify terms in newdata which correspond to the terms of m2. In this case log_variable instead of variable. I presume the following should work.

m1 <- lm(log(response) ~ log(variable))

log_response <- log(response)
log_variable <- log(variable)
m2 <- lm(log_response ~ log_variable)

newdata <- data.frame(variable = 2)
newdata2 <- data.frame(log_variable = log(2))
predict(m1, newdata, interval = "predict")
predict(m2, newdata2, interval = "predict")

Proof of concept (using toy data)

m1 <- lm(log(mpg) ~ log(wt), data = mtcars)

log_response <- log(mtcars$mpg)
log_variable <- log(mtcars$wt)
m2 <- lm(log_response ~ log_variable)

newdata <- data.frame(wt = 2)
newdata2 <- data.frame(log_variable = log(2))
pred1 <- predict(m1, newdata, interval = "predict")
pred2 <- predict(m2, newdata2, interval = "predict")

#> identical(pred1, pred2)
#[1] TRUE

Why are my linear regression models giving me different prediction results in r?

1 Answers1