I'm building a model in R, while excluding 'office' column in the formula (it sometimes contains hints of the class I predict ). I'm learning on 'train' and predicting on 'test':
> model <- randomForest::randomForest(tc ~ . - office, data=train, importance=TRUE,proximity=TRUE )
> prediction <- predict(model, test, type = "class")
the prediction resulted with all NAs:
> head(prediction)
[1] <NA> <NA> <NA> <NA> <NA> <NA>
Levels: 2668 2752 2921 3005
the reason is that test$office contains NAs:
> head(test$office)
[1] <NA> <NA> <NA> <NA> <NA> <NA>
Levels: 2668 2752 2921 3005
I can fix the problem by removing the NAs:
> test2 <- test
> test2$office <- 1
> prediction <- predict(model, test2, type = "class")
> head(prediction)
3 5 10 12 14 18
2921 2752 2921 2752 2921 2752
Levels: 2668 2752 2921 3005
I can avoid the problem by explicitly removing the column 'office' from the train data, rather then from the formula:
> model <- randomForest::randomForest(tc ~ ., data=train[,!(names(train) %in% c('office'))], importance=TRUE,proximity=TRUE )
> prediction <- predict(model, test, type = "class")
> head(prediction)
3 5 10 12 14 18
3005 2752 3005 2752 2921 2752
Levels: 2668 2752 2921 3005
>
my question - what is the reason for that behavior?
was the formula tc ~ . - office
meant to exclude 'office' from the model?
is there an elegant solution here?
EDITION:
user agenis asked for the result of str(test); I masked some of the field names:
str(test)
'data.frame': 792 obs. of 15 variables:
$ XXX : Factor w/ 2 levels "Force","Retry": 1 2 2 1 2 2 1 1 1 1 ...
$ XXX : Factor w/ 15 levels "25 Westend, Birmingham",..: 6 13 6 15 13 15 10 3 5 12 ...
$ XXX : Factor w/ 3 levels "Instructions Info 1",..: 2 2 3 2 2 2 2 3 3 3 ...
$ XXX : Factor w/ 3 levels "Remittance Info 1",..: 3 1 3 1 2 2 1 1 1 1 ...
$ XXX : Factor w/ 3 levels "CRED","DEBT",..: 3 2 1 2 1 2 1 2 2 3 ...
$ XXX : Factor w/ 3 levels "INTC","LOAN",..: 2 2 2 3 1 3 1 1 3 3 ...
$ XXX : Factor w/ 15 levels "25 Westend, Birmingham",..: 3 9 15 14 5 15 10 11 2 7 ...
$ XXX : Factor w/ 2 levels "SDVA","URGP": 1 2 1 1 1 2 2 2 2 1 ...
$ XXX : Factor w/ 3 levels "CNY","EUR","GBP": 1 2 1 1 2 1 2 1 2 3 ...
$ XXX : Factor w/ 19 levels "BNKADE22XXX",..: 3 19 11 11 4 8 8 8 19 3 ...
$ XXX : Factor w/ 4 levels "_NV_E_","CNY",..: 1 3 2 2 3 2 3 2 3 1 ...
$ XXX : Factor w/ 9 levels "BNKADE22XXX",..: 3 9 1 1 4 8 8 8 9 3 ...
$ tc : Factor w/ 4 levels "604","688","698",..: NA NA NA NA NA NA NA NA NA NA ...
$ office : Factor w/ 4 levels "604","688","698",..: NA NA NA NA NA NA NA NA NA NA ...
Shay