xgboost: handling of missing values for split candidate search

Question

in section 3.4 of their article, the authors explain how they handle missing values when searching the best candidate split for tree growing. Specifically, they create a default direction for those nodes with, as splitting feature, one with missing values in the current instance set. At prediction time, if the prediction path goes through this node and the feature value is missing, the default direction is followed.

However the prediction phase would break down when the feature values is missing and the node does not have a default direction (and this can occur in many scenarios). In other words, how do they associate a default direction to all nodes, even those with missing-free splitting feature in the active instance set at training time?

score 15 · Answer 1 · answered Jun 03 '16 at 20:40

xgboost always accounts for a missing value split direction even if none are present is training. The default is the yes direction in the split criterion. Then it is learned if there are any present in training

From the author link

This can be observed by the following code

    require(xgboost)

    data(agaricus.train, package='xgboost')

    sum(is.na(agaricus.train$data))
    ##[1] 0  

    bst <- xgboost(data = agaricus.train$data, 
                       label = agaricus.train$label, 
                       max.depth = 4, 
                       eta = .01, 
                       nround = 100,
                       nthread = 2, 
                       objective = "binary:logistic")

dt <- xgb.model.dt.tree(model = bst)  ## records all the splits 

> head(dt)
     ID Feature        Split  Yes   No Missing      Quality   Cover Tree Yes.Feature Yes.Cover  Yes.Quality
1:  0-0      28 -1.00136e-05  0-1  0-2     0-1 4000.5300000 1628.25    0          55    924.50 1158.2100000
2:  0-1      55 -1.00136e-05  0-3  0-4     0-3 1158.2100000  924.50    0           7    679.75   13.9060000
3: 0-10    Leaf           NA   NA   NA      NA   -0.0198104  104.50    0          NA        NA           NA
4: 0-11       7 -1.00136e-05 0-15 0-16    0-15   13.9060000  679.75    0        Leaf    763.00    0.0195026
5: 0-12      38 -1.00136e-05 0-17 0-18    0-17   28.7763000   10.75    0        Leaf    678.75   -0.0199117
6: 0-13    Leaf           NA   NA   NA      NA    0.0195026  763.00    0          NA        NA           NA
   No.Feature No.Cover No.Quality
1:       Leaf   104.50 -0.0198104
2:         38    10.75 28.7763000
3:         NA       NA         NA
4:       Leaf     9.50 -0.0180952
5:       Leaf     1.00  0.0100000
6:         NA       NA         NA

> all(dt$Missing == dt$Yes,na.rm = T)
[1] TRUE

source code https://github.com/tqchen/xgboost/blob/8130778742cbdfa406b62de85b0c4e80b9788821/src/tree/model.h#L542

Please note that same author has also made contradictory comment here https://github.com/dmlc/xgboost/issues/21#issuecomment-51982962. So essentially choosing direction where gain is maximum. Curiously both comments made around same time. — abhiieor, Mar 02 '18 at 10:22

score 2 · Answer 2 · answered Jun 03 '16 at 15:49

My understanding of the algorithm is that a default direction is assigned probabalistically based on the distribution of the training data if no missing data is available at training time. IE. Just go in the direction with the majority of samples in the training set. In practice I'd say it's a bad idea to have missing data in your data set. Generally, the model will perform better if the data scientist cleans the data set up in a smart way before training the GBM algorithm. For example, replace all NA with the mean/median value or impute the value by finding the K nearest neighbors and averaging their values for that feature to impute the training point.

I'm also wondering why data would be missing at test time and not at train. That seems to imply the distribution of your data is evolving over time. An algorithm that can be trained as new data is available like a neural net may do better in you use case. Or you could always make a specialist model. For example let's say the missing feature is credit score in your model. Because some people may not approve you to access their credit. Why not train one model using credit and one not using credit. The model trained excluding credit may be able to get much of the lift credit was providing by using other correlated features.

pmarini · Answer 3 · 2016-06-04T17:56:21.480

Thank you for sharing your thoughts @Josiah. Yes I totally agree with you when you say that it is better to avoid missing data in the dataset, but sometimes it is not the optimal solution to replace them. In addition, if we have a learning algorithm such as GBM that can cope with them, why not to give them a try. The scenario I'm thinking about is when you have some features with few missings (<10%) or even less.

Regarding the second point, the scenario I have in mind is the following: the tree has already be grown to some depth so that the instance set is not the full one anymore. For a new node, the best candidate is found to be a value for a feature f that originally contains some missings, but not in the current instance set, so that no default branch is defined. So even if f contains some missings in the training dataset, this node doesn't have a default branch. A test instance falling here, would be stuck.

Maybe you are right and the default branch will be the one with more examples, if no missings are present. I'll try to reach out the authors and post here the reply, if any.

I know it's been a while since you wrote this answer, but I was wondering if you had any luck establishing what is it that xgboost does during prediction, when there were no missing values in the training dataset? — ponadto, Feb 08 '17 at 07:43
Hello, please have a look at the answer below from T. Scharf — pmarini, Jun 28 '17 at 19:40

xgboost: handling of missing values for split candidate search

3 Answers3