-1

I have split my data set into testing and training data sets. I've tried to fit a regression on the training set, and then use predict on the testing set. When I do this I get an error message that says: "Error in model.frame factor x has New Levels". I know this is because there are levels in my testing data not seen in my training data.

What I want to do is just eliminate or ignore the levels that aren't in both data sets. I've tried to do this, but it isn't setting any levels to NA, and the id object says "integer (empty)":

id <- which(!(test$x %in% levels (train$x))
train$x[id] <- NA

fit <- lm(y ~ x, data=train)
P <- predict(fit,test)
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
grig109
  • 73
  • 1
  • 9
  • But even before needing to add the droplevels command, the first part isn't working properly. It seems that I either get an empty integer, or an error saying that the replacement has 190708 rows, data has 189590. – grig109 Jan 07 '17 at 16:28

1 Answers1

0

You will get "replacement length differs" error with your code.

id <- which(!(test$x %in% levels (train$x))

tells you what elements in test$x are not in levels(train$x), so you should use id to index test$x, not train$x, when doing replacement.

test$x[id] <- NA
test$x <- droplevels(test$x)  ## also don't forget to remove unused factor levels

fit <- lm(y ~ x, data = train)
P <- predict(fit, test)

All data in train will be used to build your linear regression model. Some predictions in P will be NA.


I'm still unable to get the id object to correctly identify which levels are not in both data sets. In the work-space it just shows integer(0).

Then, what is the point of your question??!! All levels in test$x are inside levels(train$x) and there is no new level.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • I'm still unable to get the Id object to correctly identify which levels are not in both data sets. In the work-space it just shows integer (empty). – grig109 Jan 07 '17 at 22:09
  • Because I get an error message that says "Error in model.frame factor x has new levels." This seems to suggest that all the levels in test$x are not in train$x. – grig109 Jan 08 '17 at 01:24