Predict.lm in R fails to recognize newdata

Question

I'm running a linear regression where the predictor is categorized by another value and am having trouble generating modeled responses for newdata.

First, I generate some random values for the predictor and the error terms. I then construct the response. Note that the predictor's coefficient depends on the value of a categorical variable. I compose a design matrix based on the predictor and its category.

set.seed(1)

category = c(rep("red", 5), rep("blue",5))
x1 = rnorm(10, mean = 1, sd = 1)
err = rnorm(10, mean = 0, sd = 1)

y = ifelse(category == "red", x1 * 2, x1 * 3)
y = y + err

df = data.frame(x1 = x1, category = category)

dm = as.data.frame(model.matrix(~ category + 0, data = df))
dm = dm * df$x1

fit = lm(y ~ as.matrix(dm) + 0, data = df)

# This line will not produce a warning
predictOne = predict.lm(fit, newdata = dm)

# This line WILL produce a warning
predictTwo = predict.lm(fit, newdata = dm[1:5,])

The warning is:

'newdata' had 5 rows but variable(s) found have 10 rows

Unless I'm very much mistaken, I shouldn't have any issues with the variable names. (There are one or two discussions on this board which suggest that issue.) Note that the first prediction runs fine, but the second does not. The only change is that the second prediction uses only the first five rows of the design matrix.

Thoughts?

The real problem here is your, shall we say, "creative" attempt at specifying a model via `lm`'s formula interface. — joran, Jan 22 '13 at 02:28
The `predict.lm` help page says the 'newdata' argument needs to be a dataframe. The warning does appear a bit off target, but is arguably better than the default behavior which is to silently report the predictions from the original data when you might have thought that you were getting new predictions. — IRTFM, Jan 22 '13 at 02:32

score 4 · Answer 1 · answered Jan 22 '13 at 02:42

I'm not 100% sure what you're trying to do, but I think a short walk-through of how formulas work will clear things up for you.

The basic idea is very simple: you pass two things, a formula and a data frame. The terms in the formula should all be names of variables in your data frame.

Now, you can get lm to work without following that guideline exactly, but you're just asking for things to go wrong. So stop and look at your model specifications and think about where R is looking for things.

When you call lm basically none of the names in your formula are actually found in the data frame df. So I suspect that df isn't being used at all.

Then if you call model.frame(fit) you'll see what R thinks your variables should be called. Notice anything strange?

model.frame(fit)
            y as.matrix(dm).categoryblue as.matrix(dm).categoryred
1   2.2588735                  0.0000000                 0.3735462
2   2.7571299                  0.0000000                 1.1836433
3  -0.2924978                  0.0000000                 0.1643714
4   2.9758617                  0.0000000                 2.5952808
5   3.7839465                  0.0000000                 1.3295078
6   0.4936612                  0.1795316                 0.0000000
7   4.4460969                  1.4874291                 0.0000000
8   6.1588103                  1.7383247                 0.0000000
9   5.5485653                  1.5757814                 0.0000000
10  2.6777362                  0.6946116                 0.0000000

Is there anything called as.matrix(dm).categoryblue in dm? Yeah, I didn't think so.

I suspect (but am not sure) that you meant to do something more like this:

df$y <- y
fit <- lm(y~category - 1,data = df)

score 2 · Answer 2 · answered Jan 22 '13 at 15:07

Joran is on the right track. The issue relates to column names. What I had wanted to do was create my own design matrix, something which, as it happens, I didn't need to do. If run the model with the following line of code, it's smooth sailing:

fit = lm(y ~ x1:category + 0, data = df)

That formula designation will replace the manual construction of the design matrix.

Using my own design matrix is something I had done in the past and the fit parameters and diagnostics were just as they ought to have been. I'd not used the predict function, so had never known that R was discarding the "data = " parameter. A warning would have been cool. R is a harsh mistress.

score 1 · Answer 3 · edited Jul 01 '19 at 11:44

1

This may help. Convert the new data as data.frame, example:

x = 1:5
y = c(2,4,6,8,10)

fit = lm(y ~ x)

# PREDICTION
newx = c(3,5,7)

predict(fit, data.frame(x=newx))

edited Jul 01 '19 at 11:44

Aaron_ab

3,450
3
28
42

answered Jul 01 '19 at 10:54

Hadi Pourbagher

151
1
5

Predict.lm in R fails to recognize newdata

3 Answers3

Linked

Related