How to get CORRECT feature importance plot in XGBOOST?

Question

Using two different methods in XGBOOST feature importance, gives me two different most important features, which one should be believed?

Which method should be used when? I am confused.

Setup

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

import seaborn as sns
import xgboost as xgb

df = sns.load_dataset('mpg')
df = df.drop(['name','origin'],axis=1)

X = df.iloc[:,1:]
y = df.iloc[:,0]

Numpy arrays

# fit the model
model_xgb_numpy = xgb.XGBRegressor(n_jobs=-1,objective='reg:squarederror')
model_xgb_numpy.fit(X.to_numpy(), y.to_numpy())

plt.bar(range(len(model_xgb_numpy.feature_importances_)), model_xgb_numpy.feature_importances_)

Pandas dataframe

# fit the model
model_xgb_pandas = xgb.XGBRegressor(n_jobs=-1,objective='reg:squarederror')
model_xgb_pandas.fit(X, y)
axsub = xgb.plot_importance(model_xgb_pandas)

Problem

Numpy method shows 0th feature cylinder is most important. Pandas method shows model year is most important. Which one is the CORRECT most important feature?

References

score 6 · Answer 1 · answered Aug 18 '20 at 06:58

There are 3 ways to get feature importance from Xgboost:

use built-in feature importance (I prefer gain type),
use permutation-based feature importance
use SHAP values to compute feature importance

In my post I wrote code examples for all 3 methods. Personally, I'm using permutation-based feature importance. In my opinion, the built-in feature importance can show features as important after overfitting to the data(this is just an opinion based on my experience). SHAP explanations are fantastic, but sometimes computing them can be time-consuming (and you need to downsample your data).

score 4 · Accepted Answer · answered Nov 23 '19 at 11:30

It is hard to define THE correct feature importance measure. Each has pros and cons. It is a wide topic with no golden rule as of now and I personally would suggest to read this online book by Christoph Molnar: https://christophm.github.io/interpretable-ml-book/. The book has an excellent overview of different measures and different algorithms.

As a rule of thumb, if you can not use an external package, i would choose gain, as it is more representative of what one is interested in (one typically is not interested in raw occurrence of splits on a particular features, but rather how much those splits helped), see this question for a good summary: https://datascience.stackexchange.com/q/12318/53060. If you can use other tools, shap exhibits very good behaviour and I would always choose it over build-in xgb tree measures, unless computation time is strongly constrained.

As for the difference that you directly pointed at in your question, the root of the difference comes from the fact that xgb.plot_importance uses weight as the default extracted feature importance type, while the XGBModel itself uses gain as the default type. If you configure them to use the same importance type, then you will get similar distributions (up to additional normalisation in feature_importance_ and sorting in plot_importance).

score 2 · Answer 3 · answered Jun 20 '22 at 04:37

From the answer here, which gives a neat explanation:

feature_importances_ returns weights - what we usually think of as "importance".

plot_importance returns the number of occurrences in splits.

Note: I think that the selected answer above does not actually cover the point.