40

Consider a simple dataset, split into a training and testing set:

dat <- data.frame(x=1:5, y=c("a", "b", "c", "d", "e"), z=c(0, 0, 1, 0, 1))
train <- dat[1:4,]
train
#   x y z
# 1 1 a 0
# 2 2 b 0
# 3 3 c 1
# 4 4 d 0
test <- dat[5,]
test
#   x y z
# 5 5 e 1

When I train a logistic regression model to predict z using x and obtain test-set predictions, all is well:

mod <- glm(z~x, data=train, family="binomial")
predict(mod, newdata=test, type="response")
#         5 
# 0.5546394 

However, this fails on an equivalent-looking logistic regression model with a "Factor has new levels" error:

mod2 <- glm(z~.-y, data=train, family="binomial")
predict(mod2, newdata=test, type="response")
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
#   factor y has new level e

Since I removed y from my model equation, I'm surprised to see this error message. In my application, dat is very wide, so z~.-y is the most convenient model specification. The simplest workaround I can think of is removing the y variable from my data frame and then training the model with the z~. syntax, but I was hoping for a way to use the original dataset without the need to remove columns.

josliber
  • 43,891
  • 12
  • 98
  • 133
  • In my case I had a bug in my code that made the model unstable. I was increasing weights of correctly classified instances and decreasing incorrectly classified instances. It should be the other way around... – felixmp Jul 01 '22 at 09:16

3 Answers3

49

You could try updating mod2$xlevels[["y"]] in the model object

mod2 <- glm(z~.-y, data=train, family="binomial")
mod2$xlevels[["y"]] <- union(mod2$xlevels[["y"]], levels(test$y))

predict(mod2, newdata=test, type="response")
#        5 
#0.5546394 

Another option would be to exclude (but not remove) "y" from the training data

mod2 <- glm(z~., data=train[,!colnames(train) %in% c("y")], family="binomial")
predict(mod2, newdata=test, type="response")
#        5 
#0.5546394 
matt_k
  • 4,139
  • 4
  • 27
  • 33
  • 1
    These are both good options -- thanks! The behavior described in the post almost seems like a bug (I don't see why I should have to remove `y` from my dataframe with the second model specification), but these are sensible workarounds. – josliber Mar 11 '14 at 14:26
  • 2
    If you run `debug` on `glm` you can see where it's creating the model terms `mt <- attr(mf, "terms")`. I think `y` is being treated as if it's in the model because when you use `z~.-y` the formula expands to `z ~ (x + y) - y`, so `y` is technically in the model, but I don't have any other insight (just a work around :)) – matt_k Mar 12 '14 at 02:53
2

We may generalize @matt_k's great solution to apply it to high-dimensional data where there are multiple factors with different levels in the training and test sets, like these:

dat2
#   x y1 y2 z
# 1 1  a  A 0
# 2 2  b  B 0
# 3 3  c  C 1
# 4 4  d  D 0
# 5 5  e  E 1

When we divide into test and training as before,

train <- dat2[1:4, ]
test <- dat2[5, ]

both y1 and y2 test levels will differ from those of train and we get the error.

mod <- glm(z ~ ., data=train, family="binomial")
predict(mod, newdata=test, type="response")
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
#   factor y1 has new level e

With high-dimensional data, it's rather boring to correct every single failing factor, so we might want to loop over them.

Either, the bad guys are of class "factor", or of class "character" (as in our case). Since these will be the ones to be included in the 'xlevels', we use a small helper that identifies them,

is.prone <- function(x) is.factor(x) | is.character(x)

and put it into Map.

id <- sapply(dat2, is.prone)
mod$xlevels <- Map(union, mod$xlevels, lapply(dat2[id], unique))

Then it should work.

predict(mod, newdata=test, type="response")
#            5 
# 5.826215e-11 
# Warning message:
# In predict.lm(object, newdata, se.fit, scale = 1, type = if (type ==  :
#   prediction from a rank-deficient fit may be misleading

dat2 <- structure(list(x = 1:5, y1 = c("a", "b", "c", "d", "e"), y2 = c("a", 
"b", "c", "d", "e"), z = c(0, 0, 1, 0, 1)), class = "data.frame", row.names = c(NA, 
-5L))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
0

I was confused about this issue for a long time. However, there was a simple solution to this. One of the variable "traffic type" had 20 factors and for one factor ie 17 there was only one row. Hence this row could be present either in train data or test data. In my case it was present in test data hence the error came - factor "traffic type" has a new level of 17 because there is no row with level 17in train data. I deleted this row from data set and model runs perfectly fine

  • Hi Bhavna -- yes, for sure you can get this error if the test set has a new level for a factor you are using in the model, and removing that observation is a reasonable way to proceed. In this question I was specifically asking about a factor I was *not* using in the model but that just happened to be present in my data frame. In this setting, we shouldn't have to remove the observation from the test set, and matt_k gives some nice approaches. – josliber Apr 17 '19 at 15:55
  • Not really a solution... You cant just delete all unknown factors from a set. – felixmp Jul 01 '22 at 08:05