24

I am trying to run my lightgbm for feature selection as below;

initialization

# Initialize an empty array to hold feature importances
feature_importances = np.zeros(features_sample.shape[1])

# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary', 
         boosting_type = 'goss', 
         n_estimators = 10000, class_weight ='balanced')

then i fit the model as below

# Fit the model twice to avoid overfitting
for i in range(2):

   # Split into training and validation set
   train_features, valid_features, train_y, valid_y = train_test_split(train_X, train_Y, test_size = 0.25, random_state = i)

   # Train using early stopping
   model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)], 
             eval_metric = 'auc', verbose = 200)

   # Record the feature importances
   feature_importances += model.feature_importances_

but i get the below error

Training until validation scores don't improve for 100 rounds. 
Early stopping, best iteration is: [6]  valid_0's auc: 0.88648
ValueError: operands could not be broadcast together with shapes (87,) (83,) (87,) 
Ian Okeyo
  • 299
  • 1
  • 4
  • 7
  • How do you initialize feature_importances ? – Florian Mutel Nov 21 '18 at 14:36
  • @FlorianMutel see th eupdated post – Ian Okeyo Nov 22 '18 at 06:30
  • What is features_sample ? How many features do you have ? I cannot reproduce your bug with Iris data for example. It seems you are trying to add arrays with different shapes. Either you initialized with wrong dimensions, or some of your features become empty (all nan), or constant when you are splitting your data (train / valid), and lightgbm ignores them. Try looking at your splits! – Florian Mutel Nov 22 '18 at 13:34

4 Answers4

25

An example for getting feature importance in lightgbm when using train model.

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

def plotImp(model, X , num = 20, fig_size = (40, 20)):
    feature_imp = pd.DataFrame({'Value':model.feature_importance(),'Feature':X.columns})
    plt.figure(figsize=fig_size)
    sns.set(font_scale = 5)
    sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", 
                                                        ascending=False)[0:num])
    plt.title('LightGBM Features (avg over folds)')
    plt.tight_layout()
    plt.savefig('lgbm_importances-01.png')
    plt.show()
Miguel Trejo
  • 5,913
  • 5
  • 24
  • 49
rosefun
  • 1,797
  • 1
  • 21
  • 33
8

Depending on whether we trained the model using scikit-learn or lightgbm methods, to get importance we should choose respectively feature_importances_ property or feature_importance() function, like in this example (where model is a result of lgbm.fit() / lgbm.train(), and train_columns = x_train_df.columns):

import pandas as pd

def get_lgbm_varimp(model, train_columns, max_vars=50):
    
    if "basic.Booster" in str(model.__class__):
        # lightgbm.basic.Booster was trained directly, so using feature_importance() function 
        cv_varimp_df = pd.DataFrame([train_columns, model.feature_importance()]).T
    else:
        # Scikit-learn API LGBMClassifier or LGBMRegressor was fitted, 
        # so using feature_importances_ property
        cv_varimp_df = pd.DataFrame([train_columns, model.feature_importances_]).T

    cv_varimp_df.columns = ['feature_name', 'varimp']

    cv_varimp_df.sort_values(by='varimp', ascending=False, inplace=True)

    cv_varimp_df = cv_varimp_df.iloc[0:max_vars]   

    return cv_varimp_df
    

Note that we rely on the assumption that feature importance values are ordered just like the model matrix columns were ordered during training (incl. one-hot dummy cols), see LightGBM #209.

mirekphd
  • 4,799
  • 3
  • 38
  • 59
  • 1
    +1, but Re:"feature_importance() function is no longer available in LightGBM python API" Actually it is still there, I think you meant Scikit-learn API. – Mustafa Aydın Oct 05 '20 at 09:43
  • 2
    `feature_importance()` is still [there](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.feature_importance) indeed, I kindly suggest you update the answer accordingly. – desertnaut Nov 18 '20 at 23:43
  • 1
    This generalization should automatically detect which API was used for training and choose the appropriate method to get importance. – mirekphd Nov 22 '20 at 14:22
4

For the LightGBM's 3.1.1 version, extending the comment of @user3067175 :

pd.DataFrame({'Value':model.feature_importance(),'Feature':features}).sort_values(by="Value",ascending=False)

is a list of feature names,within the same order of your dataset, can be replaced by features = df_train.columns.tolist(). This should return the feature importance with the same order of plot.

Note: If you use LGBMRegressor or LGBMClassifier, you should use

pd.DataFrame({'Value':model.feature_importances_,'Feature':features}).sort_values(by="Value",ascending=False)
mirekphd
  • 4,799
  • 3
  • 38
  • 59
  • 1
    It's essentially a copy of my answer from 1 year earlier... arguably also less universal and less readable:) Nevertheless making it more universal and upvoting for the nice use of chaining (and for pluralism's sake). – mirekphd Oct 14 '22 at 10:54
  • 1
    One day my answer will be outdated as well, and very humble of you, thanks :=) – Mehmet Burak Sayıcı Oct 19 '22 at 14:07
3

If you want to examine a loaded model that you don't have the training data, you can get feature importance and the feature name by

df_feature_importance = (
    pd.DataFrame({
        'feature': model.feature_name(),
        'importance': model.feature_importance(),
    })
    .sort_values('importance', ascending=False)
)
Louis Yang
  • 3,511
  • 1
  • 25
  • 24