47

The command xgb.importance returns a graph of feature importance measured by an f score.

What does this f score represent and how is it calculated?

Output: Graph of feature importance Graph of feature importance

smci
  • 32,567
  • 20
  • 113
  • 146
ishido
  • 4,065
  • 9
  • 32
  • 42
  • 3
    The question is language-neutral so I'm tagging it [tag:r], [tag:python] since those are the top-2 language users of xgboost. – smci Apr 26 '17 at 00:06

2 Answers2

43

This is a metric that simply sums up how many times each feature is split on. It is analogous to the Frequency metric in the R version.https://cran.r-project.org/web/packages/xgboost/xgboost.pdf

It is about as basic a feature importance metric as you can get.

i.e. How many times was this variable split on?

The code for this method shows it is simply adding of the presence of a given feature in all the trees.

[here..https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L953][1]

def get_fscore(self, fmap=''):
    """Get feature importance of each feature.
    Parameters
    ----------
    fmap: str (optional)
       The name of feature map file
    """
    trees = self.get_dump(fmap)  ## dump all the trees to text
    fmap = {}                    
    for tree in trees:              ## loop through the trees
        for line in tree.split('\n'):     # text processing
            arr = line.split('[')
            if len(arr) == 1:             # text processing 
                continue
            fid = arr[1].split(']')[0]    # text processing
            fid = fid.split('<')[0]       # split on the greater/less(find variable name)

            if fid not in fmap:  # if the feature id hasn't been seen yet
                fmap[fid] = 1    # add it
            else:
                fmap[fid] += 1   # else increment it
    return fmap                  # return the fmap, which has the counts of each time a  variable was split on
m0nhawk
  • 22,980
  • 9
  • 45
  • 73
T. Scharf
  • 4,644
  • 25
  • 27
  • hi, thank you for your answer. I'm having troubles understanding the source code. Could you explain it to me what exactly is happening in that function? – ishido Dec 15 '15 at 09:51
  • I've actually kind of understood. I went into the core file and had the line variable print when using xbg.plot_importance. It then splits each line to extract only the feature names and counts the number of times each was split? – ishido Dec 15 '15 at 12:22
  • @ishido you got it.. added some comments.. Without seeing the text dump of the trees its hard to exactly say what all the sting operations are doing exactly, but the larger scheme is clear i hope – T. Scharf Dec 15 '15 at 15:07
  • 1
    FYI: It's moved now, and does more - https://github.com/dmlc/xgboost/blob/b4f952b/python-package/xgboost/core.py#L1639-L1661 - Recommend using a commit-hash rather than `master` next time… – A T Jan 24 '20 at 10:13
4

I found this answer correct and thorough. It shows the implementation of the feature_importances.

https://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting

aerin
  • 20,607
  • 28
  • 102
  • 140