Error in model.frame.default for Predict() - "Factor has new levels" - For a Char Variable

Question

I have a dataset I split into test/train datasets. Immediately following that split I produced a logistic model with:

logModel1 = glm(Y ~ . -var1 -var2 -var3, data=train, family=binomial)

If I use that model to make predictions on the same train set, I get no error (though of course a not-super-useful test of my model). So I used the code below to predict on my test set:

predictLog1 <- predict(logModel1, type="response", newdata=test)

But I get the following error:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor myCharVar has new levels This is an observation of myCharVar, This is another...

Here's what's got me particularly confused:

myCharVar is a character variable in both my train and test sets. I've confirmed this with str(test$myCharVar) and str(train$myCharVar)
My model does not even use myCharVar as part of the prediction.

I found an explanation for bullet 2 at this SO link: "Factor has new levels" error for variable I'm not using

And the suggestion there to remove the character variables altogether from my train and test sets has provided me a workaround so at least I'm not held up. But that seems pretty inelegant, as opposed to just removing them from the model with "-myCharVar". If anyone understands why a character variable in my test set would throw a "factor has new levels" error I'd certainly be interested.

score 6 · Accepted Answer · edited May 23 '17 at 11:45

The person that answered the question in the post you linked to already gave an indication on why myCharVar is still considered in the model. When you use z~.-y, the formula basically expands to z~(x+y)-y.

Now, to answer your other question: Consider the following quote from the predict() documentation: "For factor variables having numeric levels, you can specify the numeric values in newdata without first converting the variables to factors. These numeric values are checked to make sure they match a level, then the variable is converted internally to a factor".

I think we can assume that the same kind of behaviour occurs for myCharVar. The myCharVar values are first checked against the corresponding existing levels in the model and this is where it goes wrong. The testset contains values for the myCharVar that were never encountered during the training of the model (note that the glm function itself also performs factor conversion. It throws a warning when conversion needs to take place). In summary, the error basically means that the model is unable to make predictions for unknown levels in the testdata that were never encountered during the training of the model.

In this post there is another clarification given on the issue.

Hi Jellen, I tried to convey I had found the answer on "why myCharVar is still considered" with "I found an explanation for bullet 2 at this SO link." Sorry if that wasn't clear. Thanks a lot for the explanation on variables being converted internally to factors, that's very helpful to know, and completely answers my question. — Max Power, Apr 26 '15 at 04:03

Error in model.frame.default for Predict() - "Factor has new levels" - For a Char Variable

1 Answers1