How is the feature score(/importance) in the XGBoost package calculated?

Question

The command xgb.importance returns a graph of feature importance measured by an f score.

What does this f score represent and how is it calculated?

The question is language-neutral so I'm tagging it [tag:r], [tag:python] since those are the top-2 language users of xgboost. — smci, Apr 26 '17 at 00:06

score 43 · Accepted Answer · edited Feb 11 '16 at 19:42

This is a metric that simply sums up how many times each feature is split on. It is analogous to the Frequency metric in the R version.https://cran.r-project.org/web/packages/xgboost/xgboost.pdf

It is about as basic a feature importance metric as you can get.

i.e. How many times was this variable split on?

The code for this method shows it is simply adding of the presence of a given feature in all the trees.

[here..https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L953][1]

def get_fscore(self, fmap=''):
    """Get feature importance of each feature.
    Parameters
    ----------
    fmap: str (optional)
       The name of feature map file
    """
    trees = self.get_dump(fmap)  ## dump all the trees to text
    fmap = {}                    
    for tree in trees:              ## loop through the trees
        for line in tree.split('\n'):     # text processing
            arr = line.split('[')
            if len(arr) == 1:             # text processing 
                continue
            fid = arr[1].split(']')[0]    # text processing
            fid = fid.split('<')[0]       # split on the greater/less(find variable name)

            if fid not in fmap:  # if the feature id hasn't been seen yet
                fmap[fid] = 1    # add it
            else:
                fmap[fid] += 1   # else increment it
    return fmap                  # return the fmap, which has the counts of each time a  variable was split on

hi, thank you for your answer. I'm having troubles understanding the source code. Could you explain it to me what exactly is happening in that function? — ishido, Dec 15 '15 at 09:51
I've actually kind of understood. I went into the core file and had the line variable print when using xbg.plot_importance. It then splits each line to extract only the feature names and counts the number of times each was split? — ishido, Dec 15 '15 at 12:22
@ishido you got it.. added some comments.. Without seeing the text dump of the trees its hard to exactly say what all the sting operations are doing exactly, but the larger scheme is clear i hope — T. Scharf, Dec 15 '15 at 15:07
FYI: It's moved now, and does more - https://github.com/dmlc/xgboost/blob/b4f952b/python-package/xgboost/core.py#L1639-L1661 - Recommend using a commit-hash rather than `master` next time… — A T, Jan 24 '20 at 10:13

score 4 · Answer 2 · answered Mar 19 '18 at 21:46

4

I found this answer correct and thorough. It shows the implementation of the feature_importances.

https://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting

answered Mar 19 '18 at 21:46

aerin

20,607
28
102
140

How is the feature score(/importance) in the XGBoost package calculated?

2 Answers2

Linked

Related