3

I'm trying to create a force_plot for my Random Forest model that has two classes (1 and 2), but I am a bit confused about the parameters for the force_plot.

I have two different force_plot parameters I can provide the following:

shap.force_plot(explainer.expected_value[0], shap_values[0], choosen_instance, show=True, matplotlib=True)

expected and shap values: 0

shap.force_plot(explainer.expected_value[1], shap_values[1], choosen_instance, show=True, matplotlib=True)

expected and shap values: 1

So my questions are:

  1. When creating the force_plot, I must supply expected_value. For my model I have two expected values: [0.20826239 0.79173761], how do I know which to use? My understanding of expected value is that it is the average prediction of my model on train data. Are there two values because I have both class_1 and class_2? So for class_1, the average prediction is 0.20826239 and class_2, it is 0.79173761?

  2. The next parameter is the shap_values, for my chosen instance:

        index   B    G    R    Prediction
       113833  107  119  237      2
    

I get the following SHAP_values:

[array([[ 0.01705462, -0.01812987,  0.23416978]]), 
 array([[-0.01705462,  0.01812987, -0.23416978]])]

I don't quite understand why I get two sets of SHAP values? Is one for class_1 and one for class_2? I have been trying to compare the images I attached, given both sets of SHAP values and expected value but I can't really explain what is going on in terms of the prediction.

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
Penguines
  • 53
  • 3

1 Answers1

4

Let's try reproducible:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from shap import TreeExplainer
from shap.maskers import Independent
from scipy.special import expit, logit

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)

Then, your SHAP expected values are:

masker = Independent(data = X_train)
explainer = TreeExplainer(model, data=masker)
ev = explainer.expected_value
ev

array([0.35468973, 0.64531027])

This is what your model would predict on average given background dataset (fed to explainer above):

model.predict_proba(masker.data).mean(0)

array([0.35468973, 0.64531027])

Then, if you have a datapoint of interest:

data_to_explain = X_train[[0]]
model.predict_proba(data_to_explain)  

array([[0.00470234, 0.99529766]])

You can achieve exactly the same with SHAP values:

sv = explainer.shap_values(data_to_explain)
np.array(sv).sum(2).ravel() 

array([-0.34998739,  0.34998739])

Note, they are symmetrical, because what increase chances towards class 1 decreases chances for 0 by the same amount.

With base values and SHAP values, the probabilities (or chances for a data point to end up in leaf 0 or 1) are:

ev + np.array(sv).sum(2).ravel()

array([0.00470234, 0.99529766])

Note, this is same as model predictions.

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
  • Hi Sergey, thanks for the great answer. I'm still unsure whether I should use the first or second array values from the shap_values, it simply depends on whether I want to show the chances towards class 0 or 1? I see they are symmetrical, but lets say I want to use the shap values to find similarity, I would then simply pick shap_values[0] or shap_values[1]? – Penguines Mar 27 '22 at 10:56
  • I'm not quite following. Shap values are meant to explain scores produced by models (relying on game theory approach proposed by Shapley). What do you mean "to use the shap values to find similarity"? The way I imagine this, you'd have a single similarity score and m sv's (n-datapoints x m-features array). – Sergey Bushmanov Mar 27 '22 at 13:08