3

I am running a random forest in R with the package randomForest.

I have two questions:

  1. Is it correct that when using this package the default criterion is Mean Decrease in Gini?

  2. I plot the variable importance with varImpPlot and obtain two measures of importance: Mean Decrease Accuracy and Mean Decrease Gini; how can I use the former for actually splitting the nodes?

rlandster
  • 7,294
  • 14
  • 58
  • 96
Matilde
  • 53
  • 5

2 Answers2

1

Yes, the standard way of computing a split for classification trees is decrease in Gini index. An alternative is using Entropy based methods, but results are similar and the formula has logarithms in it, so it is usually slower.

The split using decrease in Accuracy is usually not implemented in packages (it is not in R's randomForest and ranger, nor in Sklearn on python) as id does not respect some basic properties as a loss function and gives straight up bad results.

You can find some details here https://arxiv.org/pdf/1407.7502.pdf if you want, around page 42-45

Davide ND
  • 856
  • 6
  • 13
  • Thank you Davide! Maybe it's a trivial question, but why is the decrease in accuracy computed and not implemented? – Matilde Dec 06 '19 at 09:06
  • @Matilde usually Decrease in Accuracy is implemented by looking at the change in total Accuracy after predicting the data with a given colum permuted, and it gives a more understandable "score" of the importance. Mean decrease in Gini instead is done by summing all the Gini decreases that are obtained when splitting a given variable, and it is a less reliable score for importance. Gini HAS to be used for splitting, while Accuracy (or any score you get via permutation importance) is best for importances. On this topic I really suggest this link: https://explained.ai/rf-importance/#intro – Davide ND Dec 06 '19 at 12:24
0

The following code (from a Titanic dataset example) shows how to alternate between Gini and Entropy:

fit=rpart(Survived ~ Class + Age + Gender, data = TitanicTrain, control = 
            rpart.control(split='Entroy', cp=0.05))
mckennae
  • 46
  • 3
  • This is for a single tree from package `rpart`, not for a random forest. It has nothing to do with answering this question. – FXQuantTrader Feb 28 '21 at 10:39