7

When I use XGBoost to fit a model, it usually shows a list of messages like "updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=5". I wonder how XGBoost is performing the tree pruning? I cannot find the description about their pruning process in their paper.

Note: I do understand the decision tree pruning process e.g. pre-pruning and post-pruning. Here I am curious about the actual pruning process of XGBoost. Usually pruning requires a validation data, but XGBoost performs the pruning even when I do not give it any validation data.

DiveIntoML
  • 2,347
  • 2
  • 20
  • 36

1 Answers1

10

XGBoost grows all trees to max_depth first.

This allows for fast training as you don't have to evaluate all the regularization parameters at each node.

After each tree is grown to max_depth, you walk from the bottom of the tree ( recursively all the way to the top ) and determine whether the split and children are valid based on the hyper-parameters you selected. If the split or nodes are not valid, they are removed from the tree.

In the model dump of an XGBoost model you can observe the actual depth will be less than the max_depth during training if pruning has occurred.

Pruning requires no validation data. It is only asking a simple question as to whether the split, or resulting child nodes are valid, based on the hyper-parameters you have set during training.

T. Scharf
  • 4,644
  • 25
  • 27
  • From the slides in your answer, I think you are mostly right. However, I feel the correct answer (as in the slides) is to prune away the nodes whose splitting leads to negative gain due to the regularization, instead of "invalid nodes". Do you think if a node already has number of samples smaller than its value in hyperparameter, it will still get split until reaching the max_depth? – DiveIntoML Oct 08 '18 at 01:28
  • Yes the trees are always grown to max depth first, then pruned after. As you noted, you can observe this from the comments with the verbose flag on during training. – T. Scharf Oct 08 '18 at 15:44
  • Can the user set/unset/tune the pruning? – Helen Jun 21 '22 at 03:41
  • Er, to answer my own mini-question, it seems that pruning is controlled by the gamma hyperparameter. – Helen Jun 21 '22 at 04:04
  • 1
    Could you please cite the reference of “XGBoost grows all trees to max_depth first.”? Because I find this (https://qr.ae/pv4jef) is the opposite to what you claimed. – Hamzah Jul 13 '22 at 13:19
  • 1
    @Phoenix "Thus, yes, the split is done in a greedy manner too. As you correctly note, XGBoost does expand the tree up to max_depth and start prunes exactly because another negative split might benefit future splits. This is imporant as the minimum loss reduction required parameter is" https://stats.stackexchange.com/a/402742/30432 they have since removed the documentation (slides ) but this can be validated experimentally -- will follow up if i can find better source – T. Scharf Jul 13 '22 at 19:09