I'm using a linear regression to work with a dataset with many categorical variables that each contain several categories, up to 45 categories in one of them.
I'm sampling the data this way:
## 70% of the sample size
smp_size <- floor(0.7 * nrow(plot_data))
## set the seed to make your partition reproductible
set.seed(888)
train_ind <- sample(seq_len(nrow(plot_data)), size = smp_size)
train <- plot_data[train_ind, ]
test <- plot_data[-train_ind, ]
Then I make the model like this:
linear_model = lm(train$dependent_variable~., data = train)
The problem is that whenever I try to predict and work with the testing set, the training set contains some categories that the training set does not.
pred_data = predict(linear_model, newdata = test)
This gives me the following error:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor origin has new levels someCategory1, SomeCategory2
Is there a way to ensure that all the categories are in both the train and testing sets or is there a workaround on this?