predict.glm() with three new categories in the test data (r)(error)

Question

I have a data set called data which has 481 092 rows.

I split data into two equal halves:

The first halve (row 1: 240 546) is called train and was used for the glm();
the second halve (row 240 547 : 481 092) is called test and should be used to validate the model;

Then I started the regression:

testreg <- glm(train$returnShipment ~ train$size + train$color + train$price + 
               train$manufacturerID + train$salutation + train$state +
               train$age + train$deliverytime, 
               family=binomial(link="logit"), data=train)

Now the prediction:

prediction <- predict.glm(testreg, newdata=test, type="response")

gives me an Error:

Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137

Now I know that these levels were omitted in the regression because it doesn't show any coefficients for these levels.

I have tried this: predict.lm() with an unknown factor level in test data . But it somehow doesn't work for me or I maybe just don't get how to implement it. I want to predict the dependent binary variable but of course only with the existing coefficients. The link above suggests to tell R that rows with new levels should just be called /or treated as NA.

How can I proceed?

Edit-Suggested approach by Z. Li

I got problem in the first step:

xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]

but mID125 is NULL! What have I done wrong?

score 4 · Answer 1 · edited May 23 '17 at 12:19

It is impossible to get estimation of new factor levels, in fixed effect modelling, including linear models and generalized linear models. glm (as well as lm) keeps records of what factor levels are presented and used during model fitting, and can be found in testreg$xlevels.

Your model formula for model estimation is:

returnShipment ~ size + color + price + manufacturerID + salutation + 
                 state + age + deliverytime

then predict complains new factor levels 125, 136, 137 for manufactureID. This means, these levels are not inside testreg$xlevels$manufactureID, therefore has no associated coefficient for prediction. In this case, we have to drop this factor variable and use a prediction formula:

returnShipment ~ size + color + price + salutation + 
                 state + age + deliverytime

However, the standard predict routine can not take your customized prediction formula. There are commonly two solutions:

extract model matrix and model coefficients from testreg, and manually predict model terms we want by matrix-vector multiplication. This is what the link given in your post suggests to do;
reset the factor levels in test into any one level appeared in testreg$xlevels$manufactureID, for example, testreg$xlevels$manufactureID[1]. As such, we can still use the standard predict for prediction.

Now, let's first pick up a factor level used for model fitting

xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]

Then we assign this level to your prediction data:

replacement <- factor(rep(mID125, length = nrow(test)), levels = xlevels)
test$manufacturerID <- replacement

And we are ready to predict:

pred <- predict(testreg, test, type = "link")  ## don't use type = "response" here!!

In the end, we adjust this linear predictor, by subtracting factor estimate:

est <- coef(testreg)[paste0(manufacturerID, mID125)]
pred <- pred - est

Finally, if you want prediction on the original scale, you apply the inverse of link function:

testreg$family$linkinv(pred)

update:

You complained that you met various troubles in trying the above solutions. Here is why.

Your code:

testreg <- glm(train$returnShipment~ train$size + train$color + 
               train$price + train$manufacturerID + train$salutation + 
               train$state + train$age + train$deliverytime,
               family=binomial(link="logit"), data=train)

is a very bad way to specify your model formula. train$returnShipment, etc, will restrict the environment of getting variables strictly to data frame train, and you will have trouble in later prediction with other data sets, like test.

As a simple example for such drawback, we simulate some toy data and fit a GLM:

set.seed(0); y <- rnorm(50, 0, 1)
set.seed(0); a <- sample(letters[1:4], 50, replace = TRUE)
foo <- data.frame(y = y, a = factor(a))
toy <- glm(foo$y ~ foo$a, data = foo)  ## bad style

> toy$formula
foo$y ~ foo$a  
> toy$xlevels
$`foo$a`
[1] "a" "b" "c" "d"

Now, we see everything comes with a prefix foo$. During prediction:

newdata <- foo[1:2, ]  ## take first 2 rows of "foo" as "newdata"
rm(foo)  ## remove "foo" from R session
predict(toy, newdata)

we get an error:

Error in eval(expr, envir, enclos) : object 'foo' not found

The good style is to specify environment of getting data from data argument of the function:

foo <- data.frame(y = y, a = factor(a))
toy <- glm(y ~ a, data = foo)

then foo$ goes away.

> toy$formula
y ~ a
> toy$xlevels
$a
[1] "a" "b" "c" "d"

This would explain two things:

You complained to me in the comment that when you do testreg$xlevels$manufactureID, you get NULL;

The prediction error you posted

Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137

complains train$manufacturerID instead of test$manufacturerID.

Thank you. But I don't want to drop a whole factor variable but just the levels 125, 136 and 137 (of manufacturerID). Can I use your first code line like this: mID125 <- manufacturerID[125] ? — Vinc, May 18 '16 at 20:24
Ok maybe I misunderstand you or the word factor variable. So the levels 125, 136 and 137 of the variable manufacturerID was not used in the glm() but exist in the test data part (which you called newdata). — Vinc, May 18 '16 at 20:27
Sorry but I don't understand it completly. That's why it doesn't work. I will edit my question to let you know what I have done ok? — Vinc, May 18 '16 at 21:32
Ok I wrote what I have done so far. Maybe you can edit the "EDIT" part in the question with the right code and maybe some more explanation. And please be welcomed to keep it as simple as possible :D R is new to me and english is not my first language :D — Vinc, May 18 '16 at 21:48
@Zheyuan_Li Yes of course. I will post it in the answer below the EDIT — Vinc, May 18 '16 at 22:03
@ Zheyuan_Li I added everything you asked for in the answer :) — Vinc, May 18 '16 at 22:27
when I type: test$manufacturerID <- factor(mID125, levels xlevels) it gives me: Error in `$<-.data.frame`(`*temp*`,"manufacturerID", value =integer(0)) : replacement has 0 rows, data has 240546 — Vinc, May 18 '16 at 22:51
When I type: replacement <- factor... It tells me: Warning Message: In rep(mID125, length=nrow(test)): 'x' is NULL so the result will be NULL. And then when I type test$manufacturerID y- replacment it says: Error in `$<-.data.frame`(`*tmp*`,"manufacturerID", value =integer(0)): replacment has 0 rows, data has 240546 — Vinc, May 18 '16 at 23:06
so far xlevels is NULL and mID125 is also NULL. What am I missing here? — Vinc, May 18 '16 at 23:38
Ok thank you very much. What does soon mean? ;) The reason manufacturerID 125 is not in the train data subset is that the color is = '?'. That is why R is omitting the rows right? In "test" where manufacturerID =125 I got color='?' but also color = green. — Vinc, May 18 '16 at 23:53
Ok thank you for your advice. What do you think about when I do what you wrote + delete the missing values of the variable colour? These missing are just 143 rows so it won't effect the model that much. Then there should't be the problem with the manufacturerID 125, 136, 137 anymore. — Vinc, May 19 '16 at 00:13

score 3 · Answer 2 · answered May 18 '16 at 18:10

As you have divided your train and test sample based on rownumbers, some factor levels of your variables are not equally represented in both the train and test samples.

You need to do stratified sampling to ensure that both train and test samples have all factor level representations. Use stratified from the splitstackshape package.

predict.glm() with three new categories in the test data (r)(error)

2 Answers2

Linked