24

Preprocessing the training data (such as centering or scaling) before training an XGBoost model, can lead to a loss of feature names. Most answers on SO suggest training the model in such a way that feature names aren't lost (such as using pd.get_dummies on data frame columns).

I have trained an XGBoost model using the preprocessed data (centre and scale using MinMaxScaler). Thereby, I am in a similar situation where feature names are lost.

For instance:

    scaler = MinMaxScaler(feature_range=(0, 1))
    X = scaler.fit_transform(X)
    my_model_name = XGBClassifier()
    my_model_name.fit(X,Y)` 

where X and Y are the training data and labels respectively. The scaling above returns a 2D NumPy array, thereby discarding feature names from pandas DataFrame.

Thus, when I try to use plot_importance(my_model_name), it leads to the plot of feature importance, but only with feature names such as f0, f1, f2 etc., and not the actual feature names from the original data set. Is there a way to map the feature names from the original training data to the feature importance plot generated, so that the original feature names are plotted in the graph? Any help in this regard is highly appreciated.

mirekphd
  • 4,799
  • 3
  • 38
  • 59
  • 1
    Maybe this can help https://stackoverflow.com/questions/44511636/matplotlib-plot-feature-importance-with-feature-names – JChat Mar 01 '19 at 23:18

4 Answers4

42

You can get the features names by:

model.get_booster().feature_names

Binyamin Even
  • 3,318
  • 1
  • 18
  • 45
  • 2
    As you can see in my answer (and even in the question) this is not correct answer since you loose the original feature names when you pass numpy array into fit method. – Nerxis Feb 01 '21 at 10:53
  • That is why you should pass DataFrame and not Numpy array. – Binyamin Even Feb 04 '21 at 16:24
  • I do not agree. Yes, probably in most cases it's the best way to go. But in other cases (even e.g. in my current project) where you have complicated data preparation process and work with NumPy arrays (from different reasons e.g. performance, ...), it's much easier to pass this array. – Nerxis Feb 05 '21 at 07:42
  • 2
    And regarding you answer, you might add your note about using DataFrame instead of NumPy array to your answer because now it does not answer the question since the user is using NumPy array and thus using `model.get_booster().feature_names` does not work for him. – Nerxis Feb 05 '21 at 07:44
  • 3
    This does not work if the model has been saved and then loaded using save_model and load_model. – Brady Gilg Jun 12 '21 at 22:53
  • FWIW - in certain cases passing a DF is not an option and then the `model.get_booster().feature_names` returns `None`. Combining @Nerxis reply I managed to SET the feature names before save_model an then they were easily available after load_model. To clarify: `model.get_booster().feature_names = orig_feature_names` worked. – roy650 Aug 08 '22 at 07:42
12

You are right that when you pass NumPy array to fit method of XGBoost, you loose the feature names. In such a case calling model.get_booster().feature_names is not useful because the returned names are in the form [f0, f1, ..., fn] and these names are shown in the output of plot_importance method as well.

But there should be several ways how to achieve what you want - supposed you stored your original features names somewhere, e.g. orig_feature_names = ['f1_name', 'f2_name', ..., 'fn_name'] or directly orig_feature_names = X.columns if X was pandas DataFrame.

Then you should be able to:

  • change stored feature names (model.get_booster().feature_names = orig_feature_names) and then use plot_importance method that should already take the updated names and show it on the plot
  • or since this method return matplotlib ax, you can modified labels using plot_importance(model).set_yticklabels(orig_feature_names) (but you have to set the correct order of you features)
  • or you can take model.feature_importances_ and combine it with your original feature names by yourselves (i.e. plotting it by ourselves)
  • similarly, you can also use model.get_booster().get_score() method and combine it with your feature names
  • or you can try Learning API with xgboost DMatrix and specify your feature names during creating of the dataset (after scaling) with train_data = xgb.DMatrix(X, label=Y, feature_names=orig_feature_names) (but I do not have much experience with this way of training since I usually use Scikit-Learn API)

EDIT:

Thanks to @Noob Programmer (see comments below) there might be some "inconsistencies" based on using different feature importance method. Those are the most important ones:

  • xgboost.plot_importance uses "weight" as the default importance type (see plot_importance)
  • model.get_booster().get_score() also uses "weight" as the default (see get_score)
  • model.feature_importances_ depends on importance_type parameter (model.importance_type) and it seems that the result is normalized to sum of 1 (see this comment)

For more info on this topic, look at How to get feature importance.

Nerxis
  • 3,452
  • 2
  • 23
  • 39
  • model.feature_importance and plot_importance(model, type = "gain), don't give out the same features, So that 3rd point is not legit. Are the numbers after f, like "f1001" indices of the features in the dataframe? – Noob Programmer Nov 29 '21 at 14:34
  • @NoobProgrammer: Thanks for the comment, see the updated answer. The result should be the same, the difference is the normalization. Feel free to update the answer if you think it's not clear enough. Regarding the numbers, yes, those should be indices of the features in the dataframe (or numpy or any input data). That's why you can use `model.get_booster().feature_names = orig_feature_names`. Or you could parse those indices and use it directly on the resulting dict for example. – Nerxis Nov 30 '21 at 08:49
0

I tried the above answers, and didn't work while loading the model after training. So, the working code for me is :

model.feature_names

it returns a list of the feature names

Lara Wehbe
  • 33
  • 5
0

I think, it is best to turn numpy array back into pandas DataFrame. E.g.

import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier


Y=label

X_df = pd.read_csv("train.csv")
orig_feature_names = list(X_df.columns)

scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled_np = scaler.fit_transform(X_df)
X_scaled_df = pd.DataFrame(X_scaled_np, columns=orig_feature_names)

my_model_name = XGBClassifier(max_depth=2, n_estimators=2)
my_model_name.fit(X_scaled_df,Y)

xgb.plot_importance(my_model_name)
plt.show()

This will show the original names.

rarry
  • 3,553
  • 20
  • 23