9

Here is my code:

set.seed(1)

#Boruta on the HouseVotes84 data from mlbench
library(mlbench) #has HouseVotes84 data
library(h2o)     #has rf

#spin up h2o
myh20 <- h2o.init(nthreads = -1)

#read in data, throw some away
data(HouseVotes84)
hvo <- na.omit(HouseVotes84)

#move from R to h2o
mydata <- as.h2o(x=hvo,
                 destination_frame= "mydata")

#RF columns (input vs. output)
idxy <- 1
idxx <- 2:ncol(hvo)

#split data
splits <- h2o.splitFrame(mydata,           
                         c(0.8,0.1))     

train <- h2o.assign(splits[[1]], key="train")   
valid <- h2o.assign(splits[[2]], key="valid") 

# make random forest
my_imp.rf<- h2o.randomForest(y=idxy,x=idxx,
                      training_frame = train,
                      validation_frame = valid,
                      model_id = "my_imp.rf",
                      ntrees=200)

# find importance
my_varimp <- h2o.varimp(my_imp.rf)
my_varimp

The output that I am getting is "variable importance".

The classic measures are "mean decrease in accuracy" and "mean decrease in gini coefficient".

My results are:

> my_varimp
Variable Importances: 
   variable relative_importance scaled_importance percentage
1        V4         3255.193604          1.000000   0.410574
2        V5         1131.646484          0.347643   0.142733
3        V3          921.106567          0.282965   0.116178
4       V12          759.443176          0.233302   0.095788
5       V14          492.264954          0.151224   0.062089
6        V8          342.811554          0.105312   0.043238
7       V11          205.392654          0.063097   0.025906
8        V9          191.110046          0.058709   0.024105
9        V7          169.117676          0.051953   0.021331
10      V15          135.097076          0.041502   0.017040
11      V13          114.906586          0.035299   0.014493
12       V2           51.939777          0.015956   0.006551
13      V10           46.716656          0.014351   0.005892
14       V6           44.336708          0.013620   0.005592
15      V16           34.779987          0.010684   0.004387
16       V1           32.528778          0.009993   0.004103

From this my relative importance of "Vote #4" aka V4, is ~3255.2.

Questions: What units is that in? How is that derived?

I tried looking in documentation, but am not finding the answer. I tried the help documentation. I tried using Flow to look at parameters to see if anything in there indicated it. In none of them do I find "gini" or "decrease accuracy". Where should I look?

EngrStudent
  • 1,924
  • 31
  • 46

1 Answers1

6

The answer is in the docs.

[ In the left pane, click on "Algorithms", then "Supervised", then "DRF". The FAQ section answers this question. ]

For convenience, the answer is also copied and pasted here:

"How is variable importance calculated for DRF? Variable importance is determined by calculating the relative influence of each variable: whether that variable was selected during splitting in the tree building process and how much the squared error (over all trees) improved as a result."

TomKraljevic
  • 3,661
  • 11
  • 14
Erin LeDell
  • 8,704
  • 1
  • 19
  • 35
  • 1
    I think you are pointing to where it says "_Variable importance is determined by calculating the relative influence of each variable: whether that variable was selected during splitting in the tree building process and how much the squared error (over all trees) improved as a result._" This isn't "Gini" or "Decrease in Accuracy". Is there an equation, a paper reference, or pseudocode? I'm finding very different behavior than the output of the R 'Boruta' library for RF. – EngrStudent Mar 16 '16 at 02:21
  • 5
    We use the same tree code in our GBM and RF, so the underlying equation used is the same in both (although the algos work differently so the final GBM and RF importance values will be different). The reference is equation 45 in this paper: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf – Erin LeDell Mar 17 '16 at 22:30
  • 1
    I love greedy approximations. The importance is going to be as fundamentally different as the GBM is from the RF. Thank you. – EngrStudent Mar 18 '16 at 01:06
  • 1
    @EngrStudent Am I missing something here? Because I don't see your question being answered. The SO answer is referring to "squared error" and the same for the relevant section of the linked paper. But your question was regarding a classification problem. – cryo111 Dec 11 '17 at 13:45
  • Erin pointed to it in the comment. I was able to get where I needed from there, I think. It was almost 2 years ago, so I don't have it off the top of my head. – EngrStudent Dec 11 '17 at 17:42
  • I have tried to reproduce the numbers but failed, see a post of mine at stats.stackexchange.com: https://stats.stackexchange.com/questions/318227/random-forest-variable-importance-in-h2o-classification-problem – cryo111 Dec 11 '17 at 18:39