5

I need to plot how each feature impacts the predicted probability for each sample from my LightGBM binary classifier. So I need to output Shap values in probability, instead of normal Shap values. It does not appear to have any options to output in term of probability.

The example code below is what I use to generate dataframe of Shap values and do a force_plot for the first data sample. Does anyone know how I should modify the code to change the output? I'm new to Shap value and the Shap package. Thanks a lot in advance.

import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size=0.2)
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)


explainer = shap.TreeExplainer(model)
shap_values = explainer(X_train)

# force plot of first row for class 1
class_idx = 1
row_idx = 0
expected_value = explainer.expected_value[class_idx]
shap_value = shap_values[:,:,class_idx].values[row_idx]

shap.force_plot (base_value = expected_value,  shap_values = shap_value, features = X_train.iloc[row_idx, :], matplotlib=True)

# dataframe of shap values for class 1
shap_df = pd.DataFrame(shap_values[:,:, 1 ].values, columns = shap_values.feature_names)
Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
zesla
  • 11,155
  • 16
  • 82
  • 147
  • There used to be a 500 bounty attached to this question, didn't it? – Sergey Bushmanov Mar 23 '22 at 02:43
  • yes. I just accepted your answer. Did you get the 500 bounty? sorry for the late reply – zesla Mar 27 '22 at 03:12
  • No, not at all. The issue there was a significant attention drawn to this question due to the bounty (judged by see count, e.g.). But finally bounty disappeared with debiting your account I believe, and without crediting anyone who spent their times. Strange. Can you clarify with the mods what happened? Anyways, thanks for accepting! – Sergey Bushmanov Mar 27 '22 at 04:05
  • so sorry about that. I do not know what happened. Is that because I accept the answer too late? Not clear about the bounty policy. Is their anyway that I can transfoer the points to you directly? – zesla Mar 28 '22 at 00:16
  • I do not think there is something to be sorry about. But I do expect the service working properly (and fair). Can you raise a flag underneath the question and explain mod the situation? Appreciate it. – Sergey Bushmanov Mar 28 '22 at 02:46

3 Answers3

8

TL;DR:

You can achieve plotting results in probability space with link="logit" in the force_plot method:

import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from scipy.special import expit

shap.initjs()

data = load_breast_cancer()

X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)

explainer_raw = shap.TreeExplainer(model)
shap_values = explainer_raw(X_train)

# force plot of first row for class 1
class_idx = 1
row_idx = 0
expected_value = explainer_raw.expected_value[class_idx]
shap_value = shap_values[:, :, class_idx].values[row_idx]

shap.force_plot(
    base_value=expected_value,
    shap_values=shap_value,
    features=X_train.iloc[row_idx, :],
    link="logit",
)

Expected output:

enter image description here

Alternatively, you may achieve the same with the following, explicitly specifying model_output="probability" you're interested in to explain:

explainer = shap.TreeExplainer(
    model,
    data=X_train,
    feature_perturbation="interventional",
    model_output="probability",
)
shap_values = explainer(X_train)

# force plot of first row for class 1
class_idx = 1
row_idx = 0

shap_value = shap_values.values[row_idx]

shap.force_plot(
    base_value=expected_value, 
    shap_values=shap_value, 
    features=X_train.iloc[row_idx, :]
)

Expected output:

enter image description here

However, it might be more interesting for understanding what's happening here to find out where these figures come from:

  1. Our target proba for the point of interest:
model_proba= model.predict_proba(X_train.iloc[[row_idx]])
model_proba
# array([[0.00275887, 0.99724113]])
  1. Base case raw from model given X_train as background (note, LightGBM outputs raw for class 1):
model.predict(X_train, raw_score=True).mean()
# 2.4839751932445577
  1. Base case raw from SHAP (note, they are symmetric):
bv = explainer_raw(X_train).base_values[0]
bv
# array([-2.48397519,  2.48397519])
  1. Raw SHAP values for the point of interest:
sv_0 = explainer_raw(X_train).values[row_idx].sum(0)
sv_0
# array([-3.40619584,  3.40619584])
  1. Proba inferred from SHAP values (via sigmoid):
shap_proba = expit(bv + sv_0)
shap_proba
# array([0.00275887, 0.99724113])
  1. Check:
assert np.allclose(model_proba, shap_proba)

Please ask questions if something is not clear.

Side notes

Proba might be misleading if you're analyzing raw size effect of different features because sigmoid is non-linear and saturates after reaching certain threshold.

Some people expect to see SHAP values in probability space as well, but this is not feasible because:

  • SHAP values are additive by construction (to be precise SHapley Additive exPlanations are average marginal contributions over all possible feature coalitions)
  • exp(a + b) != exp(a) + exp(b)

You may find useful:

  1. Feature importance in a binary classification and extracting SHAP values for one of the classes only answer

  2. How to interpret base_value of GBT classifier when using SHAP? answer

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
  • Can I ask you why did you sum the base_value with Shap value? I understand that the base value is the model's output without having any feature which is the mean prediction. But why did you sum these values? Shouldn't we get the values and pass to the logistic function to get the probability? – Amir Jalilifard Jun 02 '23 at 12:48
0

You can consider running your output values through a softmax() function. For reference, it is defined as :

def get_softmax_probabilities(x):
    return np.exp(x) / np.sum(np.exp(x)).reshape(-1, 1)

and there is a scipy implementation as well:

from scipy.special import softmax

The output from softmax() will be probabilities proportional to the (relative) values in vector x, which are your shop values.

0
import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size=0.2)
print('X_train: ',X_train.shape)
print('X_test: ',X_test.shape)

model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)

# plot
# shap.summary_plot(shap_values[class_idx], X_train, plot_type='bar')
# shap.summary_plot(shap_values[class_idx], X_train)

# shap_value = shap_values[:,:,class_idx].values[row_idx]
# shap.force_plot (base_value = expected_value,  shap_values = shap_value, features = X_train.iloc[row_idx, :], matplotlib=True)
# # dataframe of shap values for class 1
# shap_df = pd.DataFrame(shap_values[:,:, 1 ].values, columns = shap_values.feature_names)

# verification
def verification(index_number,class_idx):
    print('-----------------------------------')
    print('index_number: ', index_number)
    print('class_idx: ', class_idx)
    print('')
    
    y_base = explainer.expected_value[class_idx]
    print('y_base: ', y_base)

    player_explainer = pd.DataFrame()
    player_explainer['feature_value'] = X_train.iloc[j].values
    player_explainer['shap_value'] = shap_values[class_idx][j]
    print('verification: ')
    print('y_base + sum_of_shap_values: %.2f'%(y_base + player_explainer['shap_value'].sum()))
    print('y_pred: %.2f'%(y_train[j]))

j = 10  # index
verification(j,0)
verification(j,1)

# show: 
# X_train:  (455, 30)
# X_test:  (114, 30)
# -----------------------------------
# index_number:  10
# class_idx:  0

# y_base:  -2.391423081639827
# verification: 
# y_base + sum_of_shap_values: -9.40
# y_pred: 1.00
# -----------------------------------
# index_number:  10
# class_idx:  1

# y_base:  2.391423081639827
# verification: 
# y_base + sum_of_shap_values: 9.40
# y_pred: 1.00
# -9.40,9.40 takes the maximum value(class_idx:1 = y_pred), and the result is obviously correct.

I helped you achieve it and verified the reliability of the results.

lazy
  • 744
  • 3
  • 13