I have a hard time interpreting the data.table returned by the xgb.importance() function of xgboost and I would appreciate any assistance in helping me understand the meaning and the intuition behind the columns of this table.
To make things reproducible and concrete I provide the following code:
library(data.table)
library(dplyr)
library(xgboost)
library(ISLR)
data(Auto)
Auto = Auto %>% mutate(
origin = ifelse(origin == 2, 1, 0)
)
Auto = Auto %>% select(-name)
library(caTools)
split = sample.split(Auto$origin, SplitRatio = 0.80)
train = subset(Auto, split == TRUE)
test = subset(Auto, split == FALSE)
X_train = as.matrix(train %>% select(-origin))
X_test = as.matrix(test %>% select(-origin))
Y_train = train$origin
Y_test = test$origin
positive = sum(Y_train == 1)
negative = sum(Y_train == 0)
Total = length(Y_train)
weight = ifelse(Y_train == 1, Total/positive, Total/negative)
dtrain = xgb.DMatrix(data = X_train, label = Y_train )
dtest = xgb.DMatrix(data = X_test, label = Y_test)
model = xgb.train(data = dtrain,
verbose =2,
params = list(objective = "binary:logistic"),
weight = weight,
nrounds = 20)
y_pred = predict(model, X_test)
table(y_pred > 0.5, Y_test)
important_variables = xgb.importance(model = model, feature_names = colnames(X_train), data = X_train, label = Y_train)
important_variables
dim(important_variables)
The first rows of the important_variable data.table are the following:
Feature Split Gain Cover Frequency RealCover RealCover %
displacement 121.5 0.132621660 0.057075548 0.015075377 17 0.31481481
displacement 190.5 0.096984485 0.106824987 0.050251256 17 0.31481481
displacement 128 0.069083692 0.093517155 0.045226131 28 0.51851852
weight 2931.5 0.054731622 0.034017383 0.015075377 9 0.16666667
mpg 30.75 0.036373687 0.015353348 0.010050251 44 0.81481481
acceleration 19.8 0.030658707 0.043746304 0.015075377 50 0.92592593
displacement 169.5 0.028471073 0.035860862 0.020100503 20 0.37037037
displacement 113.5 0.028467685 0.017729564 0.020100503 27 0.50000000
horsepower 59 0.028450597 0.022879182 0.025125628 22 0.40740741
weight 2670.5 0.028335853 0.020309028 0.010050251 6 0.11111111
acceleration 15.6 0.022315984 0.026517622 0.015075377 51 0.94444444
weight 1947.5 0.020687204 0.003763738 0.005025126 7 0.12962963
acceleration 14.75 0.018458042 0.013565059 0.010050251 53 0.98148148
acceleration 19.65 0.018395565 0.006194124 0.010050251 53 0.98148148
According to the documentation:
The columns are:
Features name of the features as provided in feature_names or already present in the model dump;
Gain contribution of each feature to the model. For boosted tree model, each gain of each feature of each tree is taken into account, then average per feature to give a vision of the entire model. Highest percentage means important feature to predict the label used for the training (only available for tree models);
Cover metric of the number of observation related to this feature (only available for tree models);
Weight percentage representing the relative number of times a feature have been taken into trees.
While Feature and Gain have obvious meanings, the columns Cover, Frequency, RealCover and RealCover% are difficult for me to interpret.
In the first row of the table important_variables we are informed that displacement has:
- Split = 121.5
- Cover = 0.13
- Frequency = 0.05
- RealCover = 0.015
- RealCover% = 0.31
Trying to decipher the meaning of these numbers I run the following code:
train %>% filter(displacement > 121.5) %>% summarize(Count = n(), Frequency = Count/nrow(train))
Count Frequency
190 0.6070288
#
train %>% filter(displacement > 121.5) %>% group_by(origin) %>% summarize(Count = n(), Frequency = Count/nrow(train))
origin Count Frequency
0 183 0.58466454
1 7 0.02236422
#
train %>% filter(displacement < 121.5) %>% summarize(Count = n(), Frequency = Count/nrow(train))
Count Frequency
123 0.3929712
#
train %>% filter(displacement < 121.5) %>% group_by(origin) %>% summarize(Count = n(), Frequency = Count/nrow(train))
origin Count Frequency
0 76 0.2428115
1 47 0.1501597
#
Nevertheless, I am still in the dark.
Your advice will be appreciated.