Variable selection in Random forest and prediction accuracy

Question

I have a cross-section data set repeated for 2 years, 2009 and 2010. I am using the first year (2009) as a training set to train a Random Forest for a regression problem and the second year (2010) as a test set.

Load the data

df <- read.csv("https://www.dropbox.com/s/t4iirnel5kqgv34/df.cv?dl=1")

After training the Random Forest for 2009 the variable importance indicates the variable x1 is the most important one.

Random Forest using all variables

set.seed(89)
rf2009 <- randomForest(y ~ x1 + x2 + x3 + x4 + x5 + x6,
                         data = df[df$year==2009,], 
                         ntree=500,
                         mtry = 6,
                         importance = TRUE)
print(rf2009)

Call:
 randomForest(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = df[df$year ==      2009, ], ntree = 500, mtry = 6, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 6

          Mean of squared residuals: 5208746
                    % Var explained: 75.59

Variable importance

imp.all <- as.data.frame(sort(importance(rf2009)[,1],decreasing = TRUE),optional = T)
names(imp.all) <- "% Inc MSE"
imp.all

% Inc MSE
x1 35.857840
x2 16.693059
x3 15.745721
x4 15.105710
x5  9.002924
x6  6.160413

I then move on to the test set and I receive the following accuracy metrics.

Prediction and evaluation on the test set

test.pred.all <- predict(rf2009,df[df$year==2010,])
RMSE.forest.all <- sqrt(mean((test.pred.all-df[df$year==2010,]$y)^2))
RMSE.forest.all
[1] 2258.041

MAE.forest.all <- mean(abs(test.pred.all-df[df$year==2010,]$y))
MAE.forest.all
[1] 299.0751

When I then train the model without the variable x1, which was the most important one as per the above, and apply the trained model on the test set, I observe the following:

the variance explained with x1 is higher than without x1 as expected
but the RMSE for the test data is better without x1 (RMSE: 2258.041 with x1 vs. 1885.462 without x1)
nevertheless MAE is slightly better with x1 (299.0751) vs. without it (302.3382).

Random Forest excluding x1

rf2009nox1 <- randomForest(y ~ x2 + x3 + x4 + x5 + x6,
                       data = df[df$year==2009,], 
                       ntree=500,
                       mtry = 5,
                       importance = TRUE)
print(rf2009nox1)

Call:
 randomForest(formula = y ~ x2 + x3 + x4 + x5 + x6, data = df[df$year ==      2009, ], ntree = 500, mtry = 5, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 5

          Mean of squared residuals: 6158161
                    % Var explained: 71.14

Variable importance

imp.nox1 <- as.data.frame(sort(importance(rf2009nox1)[,1],decreasing = TRUE),optional = T)
names(imp.nox1) <- "% Inc MSE"
imp.nox1

   % Inc MSE
x2 37.369704
x4 11.817910
x3 11.559375
x5  5.878555
x6  5.533794

Prediction and evaluation on the test set

test.pred.nox1 <- predict(rf2009nox1,df[df$year==2010,])
RMSE.forest.nox1 <- sqrt(mean((test.pred.nox1-df[df$year==2010,]$y)^2))
RMSE.forest.nox1
[1] 1885.462

MAE.forest.nox1 <- mean(abs(test.pred.nox1-df[df$year==2010,]$y))
MAE.forest.nox1
[1] 302.3382

I am aware that the variable importance refers to the training model and not to the test one, but does this mean that the x1 variable should not be included in the model?

So, should I include x1 in the model?

score 1 · Accepted Answer · answered May 01 '20 at 15:47

1

I think you need more information about the performance of the model. With only one test sample you could speculate a lot why the RMSE is better without x1 although x1 has the highest importance. Could be a correlation between variables or explaining from noise in the train set.

To get more information I would recommend to look at the out of bag error and do hyperparameter optimization with cross-validation. If you see the same behavior after testing different Test datasets you could do cross-validation with and without x1.

Hope its helpful

answered May 01 '20 at 15:47

padul

134
11

Thank you! The same happens when I use CV using `caret`. **With `x1`:** Out-of-bag R^2 for the training set is 0.94, `varImp()` shows that x1 is the most important variable and x4 is the 4th most important, R^2 for the test set is 0.80. **Without `x1`:** Out-of-bag R^2 for the training set is 0.87 (i.e. lower than before) and x2 is the most important predictor. Weirdly R^2 for the test set is higher than before: 0.86. RMSE and MAE are smaller without `x1`! In terms of descriptive statistics, correlation between y and x1 is 0.95 and between y and x2 is only -0.2. – et_ May 03 '20 at 10:00
1

I think with the comparison with x1 and without it it makes more sense to look at the RMSE. The R^2 increases by adding another independent variable. That means the decrease of the out of bags from with to without could be simply due to this. – padul May 07 '20 at 16:42
1

To understand the importance of the variable look at the correlation between the independent variables. The importance of the variable only says something about the importance of the estimation and not about the importance of the real relationship. A variable with low direct effect can become the most important variable due to high indirect effects. – padul May 07 '20 at 16:43

Variable selection in Random forest and prediction accuracy

Load the data

Random Forest using all variables

Variable importance

Prediction and evaluation on the test set

Random Forest excluding x1

Variable importance

Prediction and evaluation on the test set

1 Answers1