4

We can get xgboost tree structure from trees_to_dataframe()

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.datasets import load_boston

data = load_boston()

X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

model = xgb.XGBRegressor(random_state=1,
                         n_estimators=1,  # 只有一棵树
                         max_depth=2,
                         learning_rate=0.1
                         )
model.fit(X, y)

tree_frame = model._Booster.trees_to_dataframe()
tree_frame

enter image description here

In which, according to the SO thread How is xgboost quality calculated?, gain should be calculated by:

enter image description here

However it is different from this code:

def mse_obj(preds, labels):
    grad = labels-preds
    hess = np.ones_like(labels)
    return grad, hess

Gain,Hessian = mse_obj(y.mean(),y)

L = X[tree_frame['Feature'][0]] < tree_frame['Split'][0]
R = X[tree_frame['Feature'][0]] >= tree_frame['Split'][0]

GL = Gain[L].sum()
GR = Gain[R].sum()
HL = Hessian[L].sum()
HR = Hessian[R].sum()

reg_lambda = 1.0
gain = (GL**2/(HL+reg_lambda)+GR**2/(HR+reg_lambda)-(GL+GR)**2/(HL+HR+reg_lambda))
gain # 18817.811191871013


L = (X[tree_frame['Feature'][0]] < tree_frame['Split'][0])&((X[tree_frame['Feature'][1]] < tree_frame['Split'][1]))
R = (X[tree_frame['Feature'][0]] < tree_frame['Split'][0])&((X[tree_frame['Feature'][1]] >= tree_frame['Split'][1]))

GL = Gain[L].sum()
GR = Gain[R].sum()
HL = Hessian[L].sum()
HR = Hessian[R].sum()

reg_lambda = 1.0
gain = (GL**2/(HL+reg_lambda)+GR**2/(HR+reg_lambda)-(GL+GR)**2/(HL+HR+reg_lambda))
gain # 7841.627971119211


L = (X[tree_frame['Feature'][0]] > tree_frame['Split'][0])&((X[tree_frame['Feature'][2]] < tree_frame['Split'][2]))
R = (X[tree_frame['Feature'][0]] > tree_frame['Split'][0])&((X[tree_frame['Feature'][2]] >= tree_frame['Split'][2]))

GL = Gain[L].sum()
GR = Gain[R].sum()
HL = Hessian[L].sum()
HR = Hessian[R].sum()

reg_lambda = 1.0
gain = (GL**2/(HL+reg_lambda)+GR**2/(HR+reg_lambda)-(GL+GR)**2/(HL+HR+reg_lambda))
gain # 2634.409414953051

Did I miss something?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Joey Gao
  • 850
  • 2
  • 7
  • 14
  • Please include the source of the `Gain` equation you are posting here, as well as the justification for leaving out the `1/2` coefficient and the subtraction of `γ`. – desertnaut Mar 08 '21 at 09:17
  • update the source in the question – Joey Gao Mar 08 '21 at 10:21
  • Setting `reg_lambda=0` gets much closer results. I haven't been able to find the source of the discrepancy though. There's [this](https://github.com/dmlc/xgboost/blob/750bd0ae9a1b633b4f25d6b1928d44eb08c03c25/src/tree/param.h#L274) which looks wrong, but it's not clear if/when that gets called, since there are many other instances of `CalcGain`, `CalcGainGivenWeight`, etc., and I haven't found where the final split gains get stored. – Ben Reiniger Mar 08 '21 at 16:29

1 Answers1

2

Eventually I found out where I was wrong. The default prediction value defined by base_score is 0.5, and we should use base_score as model's predicted value before any tree is builded when calculate the gradient for each sample.

Gain,Hessian = mse_obj(model.get_params()['base_score'], y)

After this, everything seems ok.

Joey Gao
  • 850
  • 2
  • 7
  • 14