Caret vs lm with new or unused factor levels

Question

The question of how to solve the issue of "factor has new levels" when using the result of lm to predict new data has already been asked multiple times (here, here, here amongst other). In the answers, the blame is most often put on the data itself. Suggestions revolve around fixing the NA in the data or fixing the unused levels.

However, this issue arises from the fact that drop.unused.levels = TRUE is forced by lm, and this behavior is not present when using the caret::train function instead.

Why does lm need to drop the unused levels while caret::train does not ?

Here is a MWE:

data <-  data.frame("y" = c(1:10),
                  "x" = rep(c("a", "b"),5))
data$x <- factor(data$x, levels = c("a", "b", "c")) # Explicitely specificies the unused level

# Model using lm
model_lm <- lm(y~x, data)

# Same model, using caret
nocv <- caret::trainControl(method = "none")
model_caret <- caret::train(y~x, data = data, method = "lm", trControl = nocv)

# Test data contains the new level
test_data <- data.frame("x" = rep(c("a", "b", "c")))

# This works. The intercept value is used for the "c" prediction.    
predict(model_caret, test_data)  

# This returns Error: factor x has new levels c
predict(model_lm, test_data)

Caret vs lm with new or unused factor levels

0 Answers0