Using Scikit-learn to determine feature importances per class in a RF model

Question

I have a dataset which follows the one-hot encoding pattern and my dependent variable is also binary. The first part of my code lists the important variables for the entire dataset. I used the method as mentioned in this stackoverflow post, "Using scikit to determine contributions of each feature to a specific class prediction". I am unsure as to what output I am getting. The feature importance ranks the most important feature for the entire model, "Delay Related DMS With Advice", in my case. I interpret it as that, this variable should be important either in Class 0 or Class 1 but from the output I get, it is unimportant in both Classes. The code in the stackoverflow I shared above, also shows that when the DV is binary, the output of Class 0 is the exact opposite (in terms of sign +/-) of Class 1. In my case, the values are different in both classes.

Here is how the plots look like:-

Feature Importance - Overall Model

Feature Importance - Class 0

Feature Importance - Class 1

The 2nd part of my code shows cumulative feature importances but looking at the [plot] shows that none of the variables are important. Is my formula wrong or my interpretation wrong or both?

plot

Here is my code;

import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import scale
from sklearn.ensemble import ExtraTreesClassifier


##get_ipython().run_line_magic('matplotlib', 'inline')

file = r'RCM_Binary.csv'
data = pd.read_csv()
print("data loaded successfully ...")

# Define features and target
X = data.iloc[:,:-1]
y = data.iloc[:,-1]

#split to training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=41)

# define classifier and fitting data
forest = ExtraTreesClassifier(random_state=1)
forest.fit(X_train, y_train)

# predict and get confusion matrix
y_pred = forest.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

#Applying 10-fold cross validation
accuracies = cross_val_score(estimator=forest, X=X_train, y=y_train, cv=10)
print("accuracy (10-fold): ", np.mean(accuracies))

# Features importances
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]
feature_list = [X.columns[indices[f]] for f in range(X.shape[1])]  #names of features.
ff = np.array(feature_list)

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d (%f) name: %s" % (f + 1, indices[f], importances[indices[f]], ff[indices[f]]))


# Plot the feature importances of the forest
plt.figure()
plt.rcParams['figure.figsize'] = [16, 6]
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), ff[indices], rotation=90)
plt.xlim([-1, X.shape[1]])
plt.show()


## The new additions to get feature importance to classes: 

# To get the importance according to each class:
def class_feature_importance(X, Y, feature_importances):
    N, M = X.shape
    X = scale(X)

    out = {}
    for c in set(Y):
        out[c] = dict(
            zip(range(N), np.mean(X[Y==c, :], axis=0)*feature_importances)
        )

    return out

result = class_feature_importance(X, y, importances)
print (json.dumps(result,indent=4))

# Plot the feature importances of the forest

titles = ["Did not Divert", "Diverted"]
for t, i in zip(titles, range(len(result))):
    plt.figure()
    plt.rcParams['figure.figsize'] = [16, 6]
    plt.title(t)
    plt.bar(range(len(result[i])), result[i].values(),
           color="r", align="center")
    plt.xticks(range(len(result[i])), ff[list(result[i].keys())], rotation=90)
    plt.xlim([-1, len(result[i])])
    plt.show()

The 2nd part of the code

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

# list of x locations for plotting
x_values = list(range(len(importances)))
# Make a bar chart
plt.bar(x_values, importances, orientation = 'vertical', color = 'r', edgecolor = 'k', linewidth = 1.2)
# Tick labels for x axis
plt.xticks(x_values, feature_list, rotation='vertical')
# Axis labels and title
plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');


# List of features sorted from most to least important
sorted_importances = [importance[1] for importance in feature_importances]
sorted_features = [importance[0] for importance in feature_importances]
# Cumulative importances
cumulative_importances = np.cumsum(sorted_importances)
# Make a line graph
plt.plot(x_values, cumulative_importances, 'g-')
# Draw line at 95% of importance retained
plt.hlines(y = 0.95, xmin=0, xmax=len(sorted_importances), color = 'r', linestyles = 'dashed')
# Format x ticks and labels
plt.xticks(x_values, sorted_features, rotation = 'vertical')
# Axis labels and title
plt.xlabel('Variable'); plt.ylabel('Cumulative Importance'); plt.title('Cumulative Importances');
plt.show()
# Find number of features for cumulative importance of 95%
# Add 1 because Python is zero-indexed
print('Number of features for 95% importance:', np.where(cumulative_importances > 0.95)[0][0] + 1)

Welcome to StackOverflow. In order to get good answers quickly, include a [Minimal, Complete, and Verifiable Example](https://stackoverflow.com/help/mcve) in your post. In your case, there's no starting data to work from, so others can only examine your code. In that case, all of the graphing sections of your code aren't necessary, and just make it hard to figure out what's going on. If the graphs are illustrative to your point, then you'll need to provide some example data that can be loaded through your code, and which is representative of the actual data you're working with. — andrew_reece, May 06 '18 at 16:26
@andrew_reece I edited my post to reflect your points. Being a new user, I can only add links and not images. I also broke down the code so it is easier to examine the code. — James Bond, May 06 '18 at 17:05

score 1 · Answer 1 · edited Jan 02 '21 at 19:29

1

The question might be outdated, but just in case anyone is interested:

The class_feature_importance function you copied from your source uses lines as features and columns for samples, while you do it the other way round, as most people. Therefore the calculation of feature importances per class goes awry. Changing the code to

zip(range(M))

should solve it.

edited Jan 02 '21 at 19:29

Victor Sergienko

13,115
3
57
91

answered Nov 08 '18 at 15:20

K.Hilbert

11
1

score 0 · Answer 2 · edited Jan 02 '21 at 19:27

0

Also make sure that your y variable is not an array. If it is an array you could just use

np.mean(X[Y==c])

edited Jan 02 '21 at 19:27

Stack

1,028
2
10
31

answered Oct 16 '20 at 14:03

Ravi Singh

1

Using Scikit-learn to determine feature importances per class in a RF model

2 Answers2