2

I have a hard time interpreting the data.table returned by the xgb.importance() function of xgboost and I would appreciate any assistance in helping me understand the meaning and the intuition behind the columns of this table.

To make things reproducible and concrete I provide the following code:

library(data.table)
library(dplyr)
library(xgboost)

library(ISLR)

data(Auto)

Auto = Auto %>% mutate(

    origin = ifelse(origin == 2, 1, 0)

)

Auto = Auto %>% select(-name)

library(caTools)

split = sample.split(Auto$origin, SplitRatio = 0.80)

train = subset(Auto, split == TRUE)

test = subset(Auto, split == FALSE)

X_train = as.matrix(train %>% select(-origin))
X_test = as.matrix(test %>% select(-origin))
Y_train = train$origin
Y_test = test$origin

positive = sum(Y_train == 1)
negative = sum(Y_train == 0)
Total = length(Y_train)
weight = ifelse(Y_train == 1, Total/positive, Total/negative)


dtrain = xgb.DMatrix(data = X_train, label = Y_train )

dtest = xgb.DMatrix(data = X_test, label = Y_test)

model = xgb.train(data = dtrain, 

                                       verbose =2,  

                                       params = list(objective = "binary:logistic"), 

                                    weight = weight,

                                    nrounds = 20)

y_pred = predict(model, X_test)

table(y_pred > 0.5, Y_test)

important_variables = xgb.importance(model = model, feature_names = colnames(X_train), data = X_train, label = Y_train)

important_variables

dim(important_variables)

The first rows of the important_variable data.table are the following:

  Feature   Split   Gain    Cover   Frequency   RealCover   RealCover %
displacement    121.5   0.132621660 0.057075548 0.015075377 17  0.31481481
displacement    190.5   0.096984485 0.106824987 0.050251256 17  0.31481481
displacement    128 0.069083692 0.093517155 0.045226131 28  0.51851852
weight  2931.5  0.054731622 0.034017383 0.015075377 9   0.16666667
mpg 30.75   0.036373687 0.015353348 0.010050251 44  0.81481481
acceleration    19.8    0.030658707 0.043746304 0.015075377 50  0.92592593
displacement    169.5   0.028471073 0.035860862 0.020100503 20  0.37037037
displacement    113.5   0.028467685 0.017729564 0.020100503 27  0.50000000
horsepower  59  0.028450597 0.022879182 0.025125628 22  0.40740741
weight  2670.5  0.028335853 0.020309028 0.010050251 6   0.11111111
acceleration    15.6    0.022315984 0.026517622 0.015075377 51  0.94444444
weight  1947.5  0.020687204 0.003763738 0.005025126 7   0.12962963
acceleration    14.75   0.018458042 0.013565059 0.010050251 53  0.98148148
acceleration    19.65   0.018395565 0.006194124 0.010050251 53  0.98148148

According to the documentation:

The columns are:

Features name of the features as provided in feature_names or already present in the model dump;

Gain contribution of each feature to the model. For boosted tree model, each gain of each feature of each tree is taken into account, then average per feature to give a vision of the entire model. Highest percentage means important feature to predict the label used for the training (only available for tree models);

Cover metric of the number of observation related to this feature (only available for tree models);

Weight percentage representing the relative number of times a feature have been taken into trees.

While Feature and Gain have obvious meanings, the columns Cover, Frequency, RealCover and RealCover% are difficult for me to interpret.

In the first row of the table important_variables we are informed that displacement has:

  • Split = 121.5
  • Cover = 0.13
  • Frequency = 0.05
  • RealCover = 0.015
  • RealCover% = 0.31

Trying to decipher the meaning of these numbers I run the following code:

train %>% filter(displacement > 121.5) %>% summarize(Count = n(), Frequency = Count/nrow(train))

Count   Frequency
190 0.6070288
#
train %>% filter(displacement > 121.5) %>% group_by(origin) %>% summarize(Count = n(), Frequency = Count/nrow(train))

origin  Count   Frequency
0   183 0.58466454
1   7   0.02236422
#
train %>% filter(displacement < 121.5) %>% summarize(Count = n(), Frequency = Count/nrow(train))

Count   Frequency
123 0.3929712
#
train %>% filter(displacement < 121.5) %>% group_by(origin) %>% summarize(Count = n(), Frequency = Count/nrow(train))

origin  Count   Frequency
0   76  0.2428115
1   47  0.1501597
#

Nevertheless, I am still in the dark.

Your advice will be appreciated.

j08691
  • 204,283
  • 31
  • 260
  • 272
rf7
  • 1,993
  • 4
  • 21
  • 35

1 Answers1

0

The frequency is the percentage of splits that a particular feature is involved in, with respect to every split made. You can do a sanity check by observing that the sum of the frequencies of all variables is 1.

sum(important_variables$Frequency)  
[1] 1  

It shows how many times a feature has been selected do be split from. While not as sophisticated as Gain, this can also be used as an variable importance metric.

This also explains why you are not able to obtain the same frequency numbers by doing summarize operations on training data: It is calculated on the trained xgboost model; not the data.

The Cover and its derivates are not as straightforward. See the answer to this question for a detailed answer.

Cihan
  • 2,267
  • 8
  • 19