0

So I've built a MLR model in R, it has a categorical variable in it with like 93 levels (so many). I tried grouping some levels or removing the predictor altogether but this had a negative impact so I've had to leave it in. Model seems to be working fine so I want to created a predicted vs observed plot, however when I run the predict function on my model it comes up with this error:

"Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor C has new levels xxxx, yyyy"

Has anyone had this error before? I'm not sure how to fix it, and it only comes up when I try to predict.

Here's the code I used also:

lm12<-lm(log(A)~B+C+log(D)+E+F+log(G)+log(H), data=mydata)
pred<-predict(lm12,mydata)

(B and C are categorical, the rest are continuous.)

Thank you!

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used to test and verify possible solutions. The error message indicates that you are seeing a new value for one of your variables (`C`) in your prediction set than you saw in your training set. You can't predict on categorical variables you've never seen before with simple linear models. – MrFlick Jan 06 '20 at 16:23
  • try setting the levels on the test data to be the same as on the training data. levels(testdata$column) <- levels(traindata$column) – bgaerber Jan 06 '20 at 16:28
  • This is a recurrent issue, I am not sure there is 1 unique solution. You could for example remove the concerned variable(s) (your model don't know what to do for newcomers with it) or try to find a similar profile (based on row comparison) to put newcomers in an pre-existing class. – cbo Jan 06 '20 at 16:32
  • 2
    @bgaerber That's a very dangerous code recommendation, you will completely change the level meaning if they are not already the same and in the exact same order. For example if you had `testdata <- data.frame(column=c("Male",; "Female")); traindata <- data.frame(column = c("Man", "Woman"))` you would be swapping the meaning of the values. – MrFlick Jan 06 '20 at 16:33
  • Thank you everyone for your advice, I wasn't sure how to create a reproducible example since the data is so complex. I realised that some of the levels in the training set may not appear in the test set and vice versa so I have used all the data for the train and the error still appears? – Emily Drew Jan 06 '20 at 16:48
  • A cool trick is using the recipes package mainly using step_other – Bruno Jan 06 '20 at 17:38

0 Answers0