3

I am getting the below error when I am trying to use the following code.

******Code******

    importance = bst.get_fscore(fmap='xgb.fmap')
    importance = sorted(importance.items(), key=operator.itemgetter(1))

******Error******

  File "scripts/xgboost_bnp.py", line 225, in <module>
  importance = bst.get_fscore(fmap='xgb.fmap')
  File "/usr/lib/python2.7/site-packages/xgboost/core.py", line 754, in get_fscore
    trees = self.get_dump(fmap)
  File "/usr/lib/python2.7/site-packages/xgboost/core.py", line 740, in get_dump
   ctypes.byref(sarr)))
  File "/usr/lib/python2.7/site-packages/xgboost/core.py", line 92, in _check_call
raise XGBoostError(_LIB.XGBGetLastError())
xgboost.core.XGBoostError: can not open file "xgb.fmap"
Gagan
  • 1,775
  • 5
  • 31
  • 59

1 Answers1

5

The error is raised because you are calling get_fscore with an optional parameter fmap stating that feature importance of each feature should be fetched from a feature map file called xgb.fmap, which does not exist in your file system.

Here is a function returning sorted feature names and their importances:

import xgboost as xgb
import pandas as pd

def get_xgb_feat_importances(clf):

    if isinstance(clf, xgb.XGBModel):
        # clf has been created by calling
        # xgb.XGBClassifier.fit() or xgb.XGBRegressor().fit()
        fscore = clf.booster().get_fscore()
    else:
        # clf has been created by calling xgb.train.
        # Thus, clf is an instance of xgb.Booster.
        fscore = clf.get_fscore()

    feat_importances = []
    for ft, score in fscore.iteritems():
        feat_importances.append({'Feature': ft, 'Importance': score})
    feat_importances = pd.DataFrame(feat_importances)
    feat_importances = feat_importances.sort_values(
        by='Importance', ascending=False).reset_index(drop=True)
    # Divide the importances by the sum of all importances
    # to get relative importances. By using relative importances
    # the sum of all importances will equal to 1, i.e.,
    # np.sum(feat_importances['importance']) == 1
    feat_importances['Importance'] /= feat_importances['Importance'].sum()
    # Print the most important features and their importances
    print feat_importances.head()
    return feat_importances
tuomastik
  • 4,559
  • 5
  • 36
  • 48
  • thanks for your answer, but this solution didn't show the original feature name while it just return `fxx` stand for certain feature, do you know how to map the real feature name with the importance score? – LancelotHolmes May 03 '17 at 11:59
  • 1
    I guess your training data is stored in a NumPy array? Try training the model using Pandas DataFrame (with appropriate feature names set as column names) and run the above function again. If I remember correctly, XGBoost will pick up the feature names from the column names of the Pandas DataFrame. – tuomastik May 03 '17 at 15:02
  • 2
    Or if you're defining the training data via `xgboost.DMatrix()`, you can define the feature names via its `feature_names` argument. – tuomastik May 03 '17 at 15:05
  • thanks again, you're right, I didn't set the feature_names argument in xgboost.DMatrix(), and your solution works well, I changed it to output to file to see the feature importance after my training to do a feature selection – LancelotHolmes May 04 '17 at 10:56
  • @tuomastik you are right, if your train data is in pd.DataFrame format, XGBoost will pick up the feature names – Statham Aug 29 '17 at 07:05
  • @tuomastik feature importance of each feature should be fetched from .fmap file or feature names should be fetched from .fmap file? – Eugene Bragin Jan 24 '21 at 12:47
  • @EugeneBragin The `.fmap` file contains both feature names and their importances. For more information, see [this](https://stackoverflow.com/a/34232986/5524090) – tuomastik Jan 24 '21 at 13:50