How to get feature importance in xgboost?

Question

I'm using xgboost to build a model, and try to find the importance of each feature using get_fscore(), but it returns {}

and my train code is:

dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)

So is there any mistake in my train? How to get feature importance in xgboost?

http://stackoverflow.com/questions/38212649/feature-importance-with-xgbclassifier — Graydyn Young, Jul 19 '16 at 17:32
Check this [function](https://stackoverflow.com/questions/38212649/feature-importance-with-xgbclassifier/49982926#49982926) for getting a xgboost feature importance data frame. — Ioannis Nasios, Apr 24 '18 at 11:24
You need to name the features first. For example, `bst.feature_names=['foo', 'bar', ...]`. — John Ao, Oct 06 '22 at 06:41

score 60 · Answer 1 · answered Aug 02 '18 at 03:29

In your code you can get feature importance for each feature in dict form:

bst.get_score(importance_type='gain')

>>{'ftr_col1': 77.21064539577829,
   'ftr_col2': 10.28690566363971,
   'ftr_col3': 24.225014841466294,
   'ftr_col4': 11.234086283060112}

Explanation: The train() API's method get_score() is defined as:

get_score(fmap='', importance_type='weight')

fmap (str (optional)) – The name of feature map file.
importance_type
- ‘weight’ - the number of times a feature is used to split the data across all trees.
- ‘gain’ - the average gain across all splits the feature is used in.
- ‘cover’ - the average coverage across all splits the feature is used in.
- ‘total_gain’ - the total gain across all splits the feature is used in.
- ‘total_cover’ - the total coverage across all splits the feature is used in.

https://xgboost.readthedocs.io/en/latest/python/python_api.html

Why do I get the following error : AttributeError: 'XGBClassifier' object has no attribute 'get_score' @MLKing — arash, Oct 13 '22 at 14:44
@arash You need to use `bst.get_booster().get_score(importance_type='gain')` instead — savagedata, Feb 15 '23 at 22:31

Chau Pham · Answer 2 · 2021-11-22T21:49:16.190

Get the table containing scores and feature names, and then plot it.

feature_important = model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())

data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.nlargest(40, columns="score").plot(kind='barh', figsize = (20,10)) ## plot top 40 features

For example:

Sesquipedalism · Answer 3 · 2019-08-07T20:05:06.963

33

Using sklearn API and XGBoost >= 0.81:

clf.get_booster().get_score(importance_type="gain")

or

regr.get_booster().get_score(importance_type="gain")

For this to work correctly, when you call regr.fit (or clf.fit), X must be a pandas.DataFrame.

edited Aug 07 '19 at 20:05

answered Mar 20 '19 at 19:15

Sesquipedalism

1,573
14
12

2

For some reason xgboost seems to have broken the model.feature_importances_ so that is what I was looking for. Thank you. – rhedak Apr 16 '19 at 03:46
My experience is that "X passed to `.fit` must be a pandas.DataFrame" is still true as of 0.9 o/w you get an empty dict. – alexandre iolov Sep 13 '19 at 09:45

Steven Hu · Answer 4 · 2018-06-14T12:42:47.930

Build the model from XGboost first

from xgboost import XGBClassifier, plot_importance
model = XGBClassifier()
model.fit(train, label)

this would result in an array. So we can sort it with descending

sorted_idx = np.argsort(model.feature_importances_)[::-1]

Then, it is time to print all sorted importances and the name of columns together as lists (I assume the data loaded with Pandas)

for index in sorted_idx:
    print([train.columns[index], model.feature_importances_[index]])

Furthermore, we can plot the importances with XGboost built-in function

plot_importance(model, max_num_features = 15)
pyplot.show()

use max_num_features in plot_importance to limit the number of features if you want.

plot_importance() should be called as: plot_importance(model, importance_type = 'gain') . Else different results are obtained with the 'sorted_idx' method. Default importance_type for plot_importance is 'weight'. This is for xgboost version 1.5.0. — Ashok K Harnal, Mar 09 '23 at 08:37

score 12 · Answer 5 · answered Aug 28 '20 at 10:47

According to this post there 3 different ways to get feature importance from Xgboost:

use built-in feature importance,
use permutation based importance,
use shap based importance.

Built-in feature importance

Code example:

xgb = XGBRegressor(n_estimators=100)
xgb.fit(X_train, y_train)
sorted_idx = xgb.feature_importances_.argsort()
plt.barh(boston.feature_names[sorted_idx], xgb.feature_importances_[sorted_idx])
plt.xlabel("Xgboost Feature Importance")

Please be aware of what type of feature importance you are using. There are several types of importance, see the docs. The scikit-learn like API of Xgboost is returning gain importance while get_fscore returns weight type.

Permutation based importance

perm_importance = permutation_importance(xgb, X_test, y_test)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(boston.feature_names[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")

This is my preferred way to compute the importance. However, it can fail in case highly colinear features, so be careful! It's using permutation_importance from scikit-learn.

SHAP based importance

explainer = shap.TreeExplainer(xgb)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")

To use the above code, you need to have shap package installed.

I was running the example analysis on Boston data (house price regression from scikit-learn). Below 3 feature importance:

Built-in importance

Permutation based importance

SHAP importance

All plots are for the same model! As you see, there is a difference in the results. I prefer permutation-based importance because I have a clear picture of which feature impacts the performance of the model (if there is no high collinearity).

Roozbeh · Answer 6 · 2017-01-04T21:44:49.240

11

For feature importance Try this:

Classification:

pd.DataFrame(bst.get_fscore().items(), columns=['feature','importance']).sort_values('importance', ascending=False)

Regression:

xgb.plot_importance(bst)

edited Jan 04 '17 at 21:44

answered Aug 23 '16 at 17:58

Roozbeh

189
1
6

6

neither of these solutions currently works. for some reason the model loses the feature names and returns an empty dict. – BCR Feb 17 '17 at 02:07
Is it a model you just trained or are you loading a pickled model? – Roozbeh Feb 18 '17 at 04:59

BCR · Answer 7 · 2017-02-17T18:01:20.547

For anyone who comes across this issue while using xgb.XGBRegressor() the workaround I'm using is to keep the data in a pandas.DataFrame() or numpy.array() and not to convert the data to dmatrix(). Also, I had to make sure the gamma parameter is not specified for the XGBRegressor.

fit = alg.fit(dtrain[ft_cols].values, dtrain['y'].values)
ft_weights = pd.DataFrame(fit.feature_importances_, columns=['weights'], index=ft_cols)

After fitting the regressor fit.feature_importances_ returns an array of weights which I'm assuming is in the same order as the feature columns of the pandas dataframe.

My current setup is Ubuntu 16.04, Anaconda distro, python 3.6, xgboost 0.6, and sklearn 18.1.

score 9 · Answer 8 · answered Jul 08 '17 at 20:12

I don't know how to get values certainly, but there is a good way to plot features importance:

model = xgb.train(params, d_train, 1000, watchlist)
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()

score 7 · Answer 9 · edited Feb 16 '17 at 13:22

7

Try this

fscore = clf.best_estimator_.booster().get_fscore()

edited Feb 16 '17 at 13:22

Alexander Farber

21,519
75
241
416

answered Feb 16 '17 at 13:00

koalagreener

121
1
5

not sure if this is applicable for regression but this does not work either as the `clf` doesn't have a `best_estimator_` attribute and the `get_fscore()` returns an empty dict. – BCR Feb 17 '17 at 02:08
It's for the XGBClassifier – koalagreener Feb 17 '17 at 14:47
1

`AttributeError: 'XGBClassifier' object has no attribute 'best_estimator_'` Something is wrong here. – blkpingu Jul 01 '19 at 16:21
`best_estimator_` is required only if you are using something like `GridSearchCV` for parameter tuning. If you are using xgboost without this, you should just do `clf.booster().get_fscore()` – Niyaz Jul 30 '20 at 17:04

score 3 · Answer 10 · edited Apr 24 '20 at 22:39

3

In case you are using XGBRegressor, try with: model.get_booster().get_score().

That returns the results that you can directly visualize through plot_importance command

edited Apr 24 '20 at 22:39

sentence

8,213
4
31
40

answered Apr 24 '20 at 18:09

Nicolás Fornasari

155
1
6

I am using XGBClassifier, however this is the only code that returns value for the features, I am wondering why! – arash Oct 13 '22 at 14:46

score 1 · Answer 11 · answered Aug 16 '21 at 03:52

None of the above worked for me, this was the code I ended up with, to sort features by importance.

from collections import Counter
Counter({k: v for k, v in sorted(model.get_fscore().items(), key=lambda item: item[1], reverse = True)}).most_common

just replace model with the name of your model and everything will be there. Of course I'm doing the same thing twice, there's no need to order a dict before passing to counter, but I figure it wouldn't hurt to leave it there in case anyone hates Counters.

How to get feature importance in xgboost?

11 Answers11

Built-in feature importance

Permutation based importance

SHAP based importance

Built-in importance

Permutation based importance

SHAP importance

Linked