I'm running a linear regression where the predictor is categorized by another value and am having trouble generating modeled responses for newdata.
First, I generate some random values for the predictor and the error terms. I then construct the response. Note that the predictor's coefficient depends on the value of a categorical variable. I compose a design matrix based on the predictor and its category.
set.seed(1)
category = c(rep("red", 5), rep("blue",5))
x1 = rnorm(10, mean = 1, sd = 1)
err = rnorm(10, mean = 0, sd = 1)
y = ifelse(category == "red", x1 * 2, x1 * 3)
y = y + err
df = data.frame(x1 = x1, category = category)
dm = as.data.frame(model.matrix(~ category + 0, data = df))
dm = dm * df$x1
fit = lm(y ~ as.matrix(dm) + 0, data = df)
# This line will not produce a warning
predictOne = predict.lm(fit, newdata = dm)
# This line WILL produce a warning
predictTwo = predict.lm(fit, newdata = dm[1:5,])
The warning is:
'newdata' had 5 rows but variable(s) found have 10 rows
Unless I'm very much mistaken, I shouldn't have any issues with the variable names. (There are one or two discussions on this board which suggest that issue.) Note that the first prediction runs fine, but the second does not. The only change is that the second prediction uses only the first five rows of the design matrix.
Thoughts?