2

I am using R package randomForest and to understand variable importance we can investigate varImpPlot which shows Mean decrease Gini. I have studied Random Forest in detail and am well aware of how this model works in detail, there is something that I am unable to completely understand regarding how Mean decrease Gini is calculated, or rather why it is dependent on the size of the population.

enter image description here

When we have calculated Gini index we are able to aggregate the mean decrease Gini by the following formula (divided by number of trees):

enter image description here

I understand that there will be more number of splits in each tree when having a larger population, but shouldn't these splits on average have very small decreases in Gini index?

Here is example code showing what I mean (as expected, number of trees does not affect mean decrese Gini but population has a huge effect and seems to be more or less linear with population size):

install.packages("randomForest")
library(randomForest)

set.seed(1)
a <- as.factor(c(rep(1, 20), rep(0, 30)))
b <- c(rnorm(20, 5, 2), rnorm(30, 4, 1))
c <- c(rnorm(25, 0, 1), rnorm(25, 1, 2))
data <- data.frame(a = a, b = b, c = c)

rf <- randomForest(data = data, a ~ b + c, importance = T, ntree = 300)
varImpPlot(rf)


a2 <- as.factor(c(rep(1, 200), rep(0, 300)))
b2 <- c(rnorm(200, 5, 2), rnorm(300, 4, 1))
c2 <- c(rnorm(250, 0, 1), rnorm(250, 1, 2))
data2 <- data.frame(a2 = a2, b2 = b2, c2 = c2)

rf2 <- randomForest(data = data2, a2 ~ b2 + c2, importance = T, ntree = 
300)
varImpPlot(rf2)


a3 <- as.factor(c(rep(1, 2000), rep(0, 3000)))
b3 <- c(rnorm(2000, 5, 2), rnorm(3000, 4, 1))
c3 <- c(rnorm(2500, 0, 1), rnorm(2500, 1, 2))
data3 <- data.frame(a3 = a3, b3 = b3, c3 = c3)

rf3 <- randomForest(data = data3, a3 ~ b3 + c3, importance = T, ntree = 
300)
varImpPlot(rf3)

Resulting in these following plots, where we see that the x-axis increases approximately 10x for each increase in population:

enter image description here

enter image description here

enter image description here

My guess is that there is a weight based on number of people in each split conducted That is, a split that is made in first nodes that splits 1000 people weights heavier than a split that is conducted further down the tree with say 10 people, I can't find this in any literature though since it seems that all calculations are made by taking fractions of population into regard rather than absolute numbers.

What is it that I am missing?

Raz89
  • 45
  • 1
  • 6

1 Answers1

1

Your guess is correct.

You have written down the definition of Gini impurity for a single split. Trees in a random forest are usually split multiple times. The higher nodes have more samples, and intuitively, are more "impure". So the formula for mean decrease in Gini takes the node sizes into account.

So instead of

Delta i(tau) = i(tau) - (n_l/n) i(tau_l) - (n_r/n) i(tau_r)

the decrease in impurity is calculated as

Delta i(tau) = n i(tau) - n_l i(tau_l) - n_r i(tau_r)

That is, weight the impurities by the raw counts, not the proportions.

The algorithm keeps splitting the tree to the maximum possible size (unless you specify the nodesize or maxnodes arguments). So a feature can be chosen for the splitting criterion several times. Its overall importance is the sum of the Deltas at those splits. This is the importance computation for one tree. Finally the importances are averaged across all the trees in the forest.

Let's show this with a very contrived example.

library("randomForest")
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
set.seed(1)

n <- 1000
# There are three classes in equal proportions
a <- rep(c(-10,0,10), each = n)
# One feature is useless
b <- rnorm(3*n)
# The other feature is highly predictive but we need at least two splits
c <- rnorm(3*n, a)
data <- data.frame(a = as.factor(a), b = b, c = c)

# First let's do just one split, i.e., ask for just two terminal nodes

# Expected MeanDecreaseGini:
# With one split the best we can do is separate one class from the other two
3000*(2/3) - 1000*0 - 2000*(1/2)
#> [1] 1000

# Actual MeanDecreaseGini
rf3 <- randomForest(data = data, a ~ b + c, importance = TRUE,
                    ntree = 1000, mtry = 2, maxnodes = 2)
rf3$importance[, "MeanDecreaseGini"]
#>        b        c 
#>    0.000 1008.754


# Next let's do two splits; this is enough to separate classes perfectly

# Expected MeanDecreaseGini:
3000*(2/3) - 1000*0 - 2000*(1/2)  +   2000*(1/2) - 1000*0 - 1000*0
#> [1] 2000

# Actual MeanDecreaseGini
rf3 <- randomForest(data = data, a ~ b + c, importance = TRUE,
                    ntree = 1000, mtry = 2, maxnodes = 3)
rf3$importance[, "MeanDecreaseGini"]
#>        b        c 
#>    0.000 1999.333

Created on 2019-03-08 by the reprex package (v0.2.1)

PS: It is all well and good to know how importances are computed with the Gini criterion. But read this article for an explanation why you should use permutation importances instead: https://explained.ai/rf-importance/index.html

dipetkov
  • 3,380
  • 1
  • 11
  • 19
  • Thank you so much for confirming my theory! I will absolutely have a look at permutation importance. For now, I am not using variable importance to remove any variables (since I have quite few variables available and rather large number of data points), I am more trying to get a general overview of approximate importance. Do you happen to have an article/book confirming the calculation that you provided? I have been struggling to find a source backing my theory (since I will be needing reliable sources backing my claims further down the road). Thanks again, appreciated! – Raz89 Mar 08 '19 at 01:56
  • I tried to google for some documentation which gives the formula explicitly but didn't come across anything specific. So instead I did a little simulation, where it's easy to compute the decrease in Gini impurity by hand. It works as expected, at least on this example. – dipetkov Mar 09 '19 at 18:56
  • Alright, thank you once again. Since I'm writing a report I have to find something that support the equations that I provide so I will have to continue my search for it. Seems so strange that I can't find any documentation regarding this, relatively simple, area. – Raz89 Mar 11 '19 at 01:57