It is impossible to get estimation of new factor levels, in fixed effect modelling, including linear models and generalized linear models. glm
(as well as lm
) keeps records of what factor levels are presented and used during model fitting, and can be found in testreg$xlevels
.
Your model formula for model estimation is:
returnShipment ~ size + color + price + manufacturerID + salutation +
state + age + deliverytime
then predict
complains new factor levels 125, 136, 137 for manufactureID
. This means, these levels are not inside testreg$xlevels$manufactureID
, therefore has no associated coefficient for prediction. In this case, we have to drop this factor variable and use a prediction formula:
returnShipment ~ size + color + price + salutation +
state + age + deliverytime
However, the standard predict
routine can not take your customized prediction formula. There are commonly two solutions:
- extract model matrix and model coefficients from
testreg
, and manually predict model terms we want by matrix-vector multiplication. This is what the link given in your post suggests to do;
- reset the factor levels in
test
into any one level appeared in testreg$xlevels$manufactureID
, for example, testreg$xlevels$manufactureID[1]
. As such, we can still use the standard predict
for prediction.
Now, let's first pick up a factor level used for model fitting
xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]
Then we assign this level to your prediction data:
replacement <- factor(rep(mID125, length = nrow(test)), levels = xlevels)
test$manufacturerID <- replacement
And we are ready to predict:
pred <- predict(testreg, test, type = "link") ## don't use type = "response" here!!
In the end, we adjust this linear predictor, by subtracting factor estimate:
est <- coef(testreg)[paste0(manufacturerID, mID125)]
pred <- pred - est
Finally, if you want prediction on the original scale, you apply the inverse of link function:
testreg$family$linkinv(pred)
update:
You complained that you met various troubles in trying the above solutions. Here is why.
Your code:
testreg <- glm(train$returnShipment~ train$size + train$color +
train$price + train$manufacturerID + train$salutation +
train$state + train$age + train$deliverytime,
family=binomial(link="logit"), data=train)
is a very bad way to specify your model formula. train$returnShipment
, etc, will restrict the environment of getting variables strictly to data frame train
, and you will have trouble in later prediction with other data sets, like test
.
As a simple example for such drawback, we simulate some toy data and fit a GLM:
set.seed(0); y <- rnorm(50, 0, 1)
set.seed(0); a <- sample(letters[1:4], 50, replace = TRUE)
foo <- data.frame(y = y, a = factor(a))
toy <- glm(foo$y ~ foo$a, data = foo) ## bad style
> toy$formula
foo$y ~ foo$a
> toy$xlevels
$`foo$a`
[1] "a" "b" "c" "d"
Now, we see everything comes with a prefix foo$
. During prediction:
newdata <- foo[1:2, ] ## take first 2 rows of "foo" as "newdata"
rm(foo) ## remove "foo" from R session
predict(toy, newdata)
we get an error:
Error in eval(expr, envir, enclos) : object 'foo' not found
The good style is to specify environment of getting data from data
argument of the function:
foo <- data.frame(y = y, a = factor(a))
toy <- glm(y ~ a, data = foo)
then foo$
goes away.
> toy$formula
y ~ a
> toy$xlevels
$a
[1] "a" "b" "c" "d"
This would explain two things:
- You complained to me in the comment that when you do
testreg$xlevels$manufactureID
, you get NULL
;
The prediction error you posted
Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137
complains train$manufacturerID
instead of test$manufacturerID
.