6

So I was running a Catboost model using Python, which was pretty simple, basically:

from catboost import CatBoostClassifier, Pool, cv
catboost_model = CatBoostClassifier(
    cat_features=["categorical_variable_1", "categorical_variable_2"],
    loss_function="Logloss",
    eval_metric="AUC",
    iterations=200,
)

So I wanted to get the feature importance. With XGBoost Classifier, I could prepare a dataframe with the feature importance doing something like:

importances = xgb_model.get_fscore()

feat_list = []
date = datetime.today()
for feature, importance in importances.items():
    dummy_list.append([date, feature, importance])

feat_df = pd.DataFrame(feat_list, columns=['date', 'feature', 'importance'])

Now, I wanted to do the same thing with CatBoost features. I started by doing:

catboost_model.get_feature_importance(
Pool(X_train, y_train, cat_features=["categorical_variable_1", "categorical_variable_2"]))

But I don't know how to move on from this (which should be very simple, but I'm lost). Can anyone give me a hand?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
dummyds
  • 177
  • 1
  • 2
  • 8

2 Answers2

8

In short, you can do something like

pd.DataFrame({'feature_importance': model.get_feature_importance(train_pool), 
              'feature_names': x_val.columns}).sort_values(by=['feature_importance'], 
                                                           ascending=False)

you can also make a function like (I found the explanation on Analyseup.com here

def plot_feature_importance(importance,names,model_type):
    
    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)
    
    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)
    
    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True))
    
    #Define size of bar plot
    plt.figure(figsize=(10,8))
    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + 'FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

and plot the feature importance from different boosting algorithm

#plot the xgboost result
plot_feature_importance(xgb_model.feature_importances_,train.columns,'XG BOOST')

#plot the catboost result
plot_feature_importance(cb_model.get_feature_importance(),train.columns,'CATBOOST')
Areza
  • 5,623
  • 7
  • 48
  • 79
  • This generates thounds of plots for me. Does this create feature importance plots for every tree which was fitted? – mugdi Apr 07 '22 at 19:18
6

Now you already have a Dataframe:

data = pd.DataFrame({'feature_importance': model.get_feature_importance(train_pool), 
              'feature_names': x_val.columns}).sort_values(by=['feature_importance'], 
                                                       ascending=False)

I found it easier to plot using the inbuilt pandas tool, for instance for the top 20 features:

data[:20].sort_values(by=['feature_importance'], ascending=True).plot.barh(x='feature_names', y='feature_importance')