Sampling with many categories

Question

I'm using a linear regression to work with a dataset with many categorical variables that each contain several categories, up to 45 categories in one of them.

I'm sampling the data this way:

## 70% of the sample size
smp_size <- floor(0.7 * nrow(plot_data))
## set the seed to make your partition reproductible
set.seed(888)
train_ind <- sample(seq_len(nrow(plot_data)), size = smp_size)

train <- plot_data[train_ind, ]
test <- plot_data[-train_ind, ]

Then I make the model like this:

linear_model = lm(train$dependent_variable~., data = train)

The problem is that whenever I try to predict and work with the testing set, the training set contains some categories that the training set does not.

pred_data = predict(linear_model, newdata = test)

This gives me the following error:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor origin has new levels someCategory1, SomeCategory2

Is there a way to ensure that all the categories are in both the train and testing sets or is there a workaround on this?

You can do a stratified sample, but if your training data set is 70% of your full data and you are missing categories, you either got really unlucky or you should think about collapsing the uncommon categories into an "other" bucket. — Gregor Thomas, Nov 20 '17 at 20:37

score 0 · Accepted Answer · answered Nov 22 '17 at 21:37

I ended up removing the observations with new levels on the test set. I know it has it's limitations and that the OSR2 loses reliability, but it got the job done:

test = na.omit(remove_missing_levels (fit=linear_model, test_data=test));

I found the remove_missing_levels function here.

It requires this library:

install.packages("magrittr");
library(magrittr);

Sampling with many categories

1 Answers1