4

For the code given below, I am getting different bar plots for the shap values.

In this example, I have a dataset of 1000 train samples with 9 classes and 500 test samples. I then use the random forest as the classifier and generate a model. When I go about generating the shap bar plots I get different results in these two senarios:

shap_values_Tree_tr = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
shap.summary_plot(shap_values_Tree_tr, X_train)

enter image description here

and then:

explainer2 = shap.Explainer(clf.best_estimator_.predict, X_test)
shap_values = explainer2(X_test)

enter image description here

Can you explain what is the difference between the two plots and which one to use for feature importance?

Here is my code:

from sklearn.datasets import make_classification
import seaborn as sns
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import pickle
import joblib
import warnings
import shap
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2,figsize=(20,8))
# Generate noisy Data
X_train,y_train = make_classification(n_samples=1000, 
                          n_features=50, 
                          n_informative=9, 
                          n_redundant=0, 
                          n_repeated=0, 
                          n_classes=10, 
                          n_clusters_per_class=1,
                          class_sep=9,
                          flip_y=0.2,
                          #weights=[0.5,0.5], 
                          random_state=17)

X_test,y_test = make_classification(n_samples=500, 
                          n_features=50, 
                          n_informative=9, 
                          n_redundant=0, 
                          n_repeated=0, 
                          n_classes=10, 
                          n_clusters_per_class=1,
                          class_sep=9,
                          flip_y=0.2,
                          #weights=[0.5,0.5], 
                          random_state=17)

model = RandomForestClassifier()

parameter_space = {
    'n_estimators': [10,50,100],
    'criterion': ['gini', 'entropy'],
    'max_depth': np.linspace(10,50,11),
}

clf = GridSearchCV(model, parameter_space, cv = 5, scoring = "accuracy", verbose = True) # model
my_model = clf.fit(X_train,y_train)
print(f'Best Parameters: {clf.best_params_}')

# save the model to disk
filename = f'Testt-RF.sav'
pickle.dump(clf, open(filename, 'wb'))

shap_values_Tree_tr = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
shap.summary_plot(shap_values_Tree_tr, X_train)

explainer2 = shap.Explainer(clf.best_estimator_.predict, X_test)
shap_values = explainer2(X_test)

shap.plots.bar(shap_values)

Thanks for your help and time!

Joe
  • 357
  • 2
  • 10
  • 32
  • 1
    You can't draw `X_train` and `X_test` from different distributions. This is not the way ML is supposed to work. What you're doing is akin learning English and then going to a Chinese exam (or training your NN on cats and then trying to predict dogs). You draw a dataset once and then split it with `train_test_split` like I do in the answer. – Sergey Bushmanov Sep 13 '22 at 17:20

1 Answers1

8

There are 2 problems with your code:

  1. It's not reproducible
  2. You seem to be missing some important concepts in SHAP package, namely what data is used to "train" the explainer ("true to model" or "true to data" explanation) and what data is used to predict SHAP values.

As far as the first one is concerned, you may find many tutorials and even books online.

Concerning the second:

shap_values_Tree_tr = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
shap.summary_plot(shap_values_Tree_tr, X_train)

is different to:

explainer2 = shap.Explainer(clf.best_estimator_.predict, X_test)
shap_values = explainer2(X_test)

because:

  1. first uses trained trees to predict; whereas second uses supplied X_test dataset to calculate SHAP values.
  2. Moreover, when you say
shap.Explainer(clf.best_estimator_.predict, X_test)

I'm pretty sure it's not the whole dataset X_test used for training your explainer, but rather a 100 datapoints subset of it.

  1. Finally,
shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)

is different to

explainer2(X_test)

in that in the first case you're predicting (and averaging) for X_train, whereas in the second you're predicting (and averaging) for X_test. It's easy to confirm that when you compare the shapes.

So, how to reconcile the two? See the below for a reproducible example:

1. Imports, model, and data to train explainers on:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from shap import maskers
from shap import TreeExplainer, Explainer

X, y = make_classification(1500, 10)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=1000, random_state=42) 

clf = RandomForestClassifier()
clf.fit(X_train, y_train)

background = maskers.Independent(X_train, 10) # data to train both explainers on

2. Compare explainers:

exp = TreeExplainer(clf, background)
sv = exp.shap_values(X_test)

exp2 = Explainer(clf, background)
sv2 = exp2(X_test)

np.allclose(sv[0], sv2.values[:,:,0])

True

I perhaps should have stated this from the very beginning: the 2 are guaranteed to show the same results (if used correctly), as Explainer class is a superset of TreeExplainer (it uses the latter when it sees a tree model).

Please ask questions if something is not clear.

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
  • Thanks for your answer. As I have already trained the classifier on the `RandomForest` [`clf = GridSearchCV(model, parameter_space, cv = 5, scoring = "accuracy", verbose = True)`]to gain insights as to the model, the correct methodology to use the `shap` values if I wanted to see what effect class 6 has on the model, I would use the commands `shap_values = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train); shap.summary_plot(shap_values[6], X_train)`? – Joe Aug 14 '22 at 22:16
  • 1
    Python has a 0 based numbering; so, a 6th element in an array accessed with 5. – Sergey Bushmanov Aug 15 '22 at 03:18
  • 1
    I know this may seem that I am digressing from the original question, but when I try to get the `waterfall` plot, only if I run the commands `explainer2 = shap.Explainer(clf.best_estimator_.predict, X_train); shap_values = explainer2(X_train)` I can easily get the `waterfall` plot for `shap.plots.waterfall(shap_values_recalled[6])`. If I just use `explainer = Explainer(clf.best_estimator_); shap_values_tr1 = explainer.shap_values(X_train)` I get an error if I run `shap.plots.waterfall(shap_values[6])`. I don't know why. – Joe Aug 15 '22 at 04:14
  • 2
    You're welcome to ask this as a question – Sergey Bushmanov Aug 15 '22 at 04:15
  • I submitted a new question: https://stackoverflow.com/questions/73356915/i-get-an-error-when-using-shap-plots-waterfall-after-generating-the-shap-values – Joe Aug 15 '22 at 04:28
  • I just noticed that you do not use `.predict` in your examples for exp and exp2. Can you explain why not? Thanks! – Joe Aug 15 '22 at 08:41
  • 1
    SHAP `Explainer` and `TreeExplainer` take `model` type as first argument. You may wish to check docs. – Sergey Bushmanov Aug 15 '22 at 09:13
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/247279/discussion-between-joe-and-sergey-bushmanov). – Joe Aug 15 '22 at 10:25