14

Could someone explain how the Quality column in the xgboost R package is calculated in the xgb.model.dt.tree function?

In the documentation it says that Quality "is the gain related to the split in this specific node".

When you run the following code, given in the xgboost documentation for this function, Quality for node 0 of tree 0 is 4000.53, yet I calculate the Gain as 2002.848

data(agaricus.train, package='xgboost')

train <- agarics.train

X = train$data
y = train$label

bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
               eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")

xgb.model.dt.tree(agaricus.train$data@Dimnames[[2]], model = bst)

p = rep(0.5,nrow(X))

L = which(X[,'odor=none']==0)
R = which(X[,'odor=none']==1)

pL = p[L]
pR = p[R]

yL = y[L]
yR = y[R]

GL = sum(pL-yL)
GR = sum(pR-yR)
G = sum(p-y)

HL = sum(pL*(1-pL))
HR = sum(pR*(1-pR))
H = sum(p*(1-p))

gain = 0.5 * (GL^2/HL+GR^2/HR-G^2/H)

gain

I understand that Gain is given by the following formula:

gain formula

Since we are using log loss, G is the sum of p-y and H is the sum of p(1-p) - gamma and lambda in this instance are both zero.

Can anyone identify where I am going wrong?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
dataShrimp
  • 808
  • 9
  • 14

1 Answers1

10

OK, I think I've worked it out. The value for reg_lambda is not 0 by default as given in the documentation, but is actually 1 (from param.h)

enter image description here

Also, it appears that the factor of a half is not applied when calculating the gain, so the Quality column is double what you would expect. Lastly, I also don't think gamma (also called min_split_loss) is applied to this calculation either (from update_hitmaker-inl.hpp)

enter image description here

Instead, gamma is used to determine whether to invoke pruning, but is not reflected in the gain calculation itself, as the documentation suggests.

enter image description here

If you apply these changes, you do indeed get 4000.53 as the Quality for node 0 of tree 0, as in the original question. I'll raise this as an issue to the xgboost guys, so the documentation can be changed accordingly.

dataShrimp
  • 808
  • 9
  • 14
  • man this has been bugging me a bit.. I'm going to work through it but I 'm impressed.. You should take a look at this question since it seems like you are learning xgboost inside oout.. It's been vexing me for a while..http://stackoverflow.com/questions/32950607/how-to-access-weighting-of-indiviual-decision-trees-in-xgboost – T. Scharf Dec 15 '15 at 19:12
  • I could see that the 1/2 factor wasn't applied but shouldv'e looked at the defaults in the source code. Nice Work! – T. Scharf Dec 15 '15 at 19:17
  • I know I'm late to the party, but could you explain why `p` is a vector of 0.5, and why that is compared to Y? Is it an initial, uninformed guess for Y? – Lil' Pete Dec 16 '22 at 01:19