R: variable exclusion from formula not working in presence of missing data

Question

I'm building a model in R, while excluding 'office' column in the formula (it sometimes contains hints of the class I predict ). I'm learning on 'train' and predicting on 'test':

> model <- randomForest::randomForest(tc ~ . - office, data=train,     importance=TRUE,proximity=TRUE )
> prediction <- predict(model, test, type = "class")

the prediction resulted with all NAs:

> head(prediction)
[1] <NA> <NA> <NA> <NA> <NA> <NA>
Levels: 2668 2752 2921 3005

the reason is that test$office contains NAs:

> head(test$office)
[1] <NA> <NA> <NA> <NA> <NA> <NA>
Levels: 2668 2752 2921 3005

I can fix the problem by removing the NAs:

> test2 <- test
> test2$office <- 1
> prediction <- predict(model, test2, type = "class")
> head(prediction)
   3    5   10   12   14   18 
 2921 2752 2921 2752 2921 2752 
Levels: 2668 2752 2921 3005

I can avoid the problem by explicitly removing the column 'office' from the train data, rather then from the formula:

> model <- randomForest::randomForest(tc ~ ., data=train[,!(names(train) %in% c('office'))], importance=TRUE,proximity=TRUE )
> prediction <- predict(model, test, type = "class")
> head(prediction)
   3    5   10   12   14   18 
3005 2752 3005 2752 2921 2752 
Levels: 2668 2752 2921 3005
>

my question - what is the reason for that behavior?

was the formula tc ~ . - office meant to exclude 'office' from the model?

is there an elegant solution here?

EDITION:

user agenis asked for the result of str(test); I masked some of the field names:

str(test)
'data.frame':   792 obs. of  15 variables:
 $ XXX              : Factor w/ 2 levels "Force","Retry": 1 2 2 1 2 2 1 1 1 1 ...
 $ XXX                  : Factor w/ 15 levels "25 Westend, Birmingham",..: 6 13 6 15 13 15 10 3 5 12 ...
 $ XXX                  : Factor w/ 3 levels "Instructions Info 1",..: 2 2 3 2 2 2 2 3 3 3 ...
 $ XXX                  : Factor w/ 3 levels "Remittance Info 1",..: 3 1 3 1 2 2 1 1 1 1 ...
 $ XXX                  : Factor w/ 3 levels "CRED","DEBT",..: 3 2 1 2 1 2 1 2 2 3 ...
 $ XXX                  : Factor w/ 3 levels "INTC","LOAN",..: 2 2 2 3 1 3 1 1 3 3 ...
 $ XXX                  : Factor w/ 15 levels "25 Westend, Birmingham",..: 3 9 15 14 5 15 10 11 2 7 ...
 $ XXX                  : Factor w/ 2 levels "SDVA","URGP": 1 2 1 1 1 2 2 2 2 1 ...
 $ XXX                  : Factor w/ 3 levels "CNY","EUR","GBP": 1 2 1 1 2 1 2 1 2 3 ...
 $ XXX                  : Factor w/ 19 levels "BNKADE22XXX",..: 3 19 11 11 4 8 8 8 19 3 ...
 $ XXX                  : Factor w/ 4 levels "_NV_E_","CNY",..: 1 3 2 2 3 2 3 2 3 1 ...
 $ XXX                  : Factor w/ 9 levels "BNKADE22XXX",..: 3 9 1 1 4 8 8 8 9 3 ...
 $ tc                   : Factor w/ 4 levels "604","688","698",..: NA NA NA NA NA NA NA NA NA NA ...
 $ office               : Factor w/ 4 levels "604","688","698",..: NA NA NA NA NA NA NA NA NA NA ...

Shay

score 0 · Answer 1 · answered Sep 28 '17 at 08:53

0

When you use:

FIT <- glm(tc~., data = train)

you are using all the variables but tc (is the response variable) as explanatory variables.

Furthermore, when you run

FIT <- glm(tc~. - office, data = train)

you are using all the variables but tc (is the response variable) and office as explanatory variables.

answered Sep 28 '17 at 08:53

R18

1,476
1
8
17

I'm not clear - this is exactly what I was doing in the first place- excluding 'office' from the formula, see my first line of code. the problem is that apparently it did not work. – kamashay Sep 28 '17 at 09:09

agenis · Answer 2 · 2017-09-28T09:47:08.473

0

For some reason, the randomForest function is first checking the presence of missing values in the whole data before looking at what's inside your formula. It returns an error if you have NA wherever columns they are:

Error in na.fail.default(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, : missing values in object

If there are no missing observations, the formula you specified is correct and will not use the column specified with the minus sign.

Two possibilities then:

Specify the argument na.action=na.pass to bypass the first NA check, the algorithm will run smoothly without error. This argument means litteraly "take no action" and see what's happens if you keep the NA. It's different from na.exclude that will remove the entire rows (which you don't want because the other variables of the row are non-missing)
Pre-process manually the data to either remove the missing or the entire column.

Code example:

df=mtcars
df[2:10, 'am'] <- NA
fit=randomForest::randomForest(mpg~.-am, df, na.action=na.pass)
fit$importance # check the absence of AM variable:
####      IncNodePurity
#### cyl      169.05853
#### disp     267.94975
#### hp       167.03634
#### drat      66.45550
#### wt       276.21383
#### qsec      25.33688
#### vs        30.48513
#### gear      15.39151
#### carb      24.60022

edited Sep 28 '17 at 09:47

answered Sep 28 '17 at 09:40

agenis

8,069
5
53
102

please clarify - is your answer referring to the model building or prediction phase (from your example it seems model building). my code returned no error in any stage, it just returned NAs during prediction. I made the check and my model indeed did not use the excluded variable in the formula. however, specifying the argument 'na.action=na.pass' did not help, and I still experience NAs during prediction in my scenario described. – kamashay Sep 28 '17 at 11:05
added to the original question, please see above. not sure why that's the print method, but , these are NAs, I checked via is.na function. indeed, there are no error messages. – kamashay Sep 28 '17 at 14:07
@kamashay Hi there. I did some digging but couldn't find explanation. Although there are plenty of SO questions mentionnning problems with NA in randomForest algo, that this algo was not good at handling NA and the behaviour changed across the versions. Maybe that's something to check (what is your version?). Anyway, sorry coulnd't help more; i leave my answer for now, until someone has a better guess – agenis Sep 28 '17 at 14:49
package randomForest 4.6-12. thanks for your help. I'll keep the safe track and go with possibility #2 - eliminate the variables manually. – kamashay Oct 01 '17 at 09:16
yes you can do that. If it's critical for your work, I can start a bounty on your question so that you'be pretty sure to get more answers. I can use +25 of my reputation. – agenis Oct 01 '17 at 09:56

R: variable exclusion from formula not working in presence of missing data

EDITION:

2 Answers2

Linked