3

I recently discovered this amazing library for ML interpretability. I decided to build a simple xgboost classifier using a toy dataset from sklearn and to draw a force_plot.

To understand the plot the library says:

The above explanation shows features each contributing to push the model output from the base value (the average model output over the training dataset we passed) to the model output. Features pushing the prediction higher are shown in red, those pushing the prediction lower are in blue (these force plots are introduced in our Nature BME paper).

So it looks to me as the base_value should be the same as clf.predict(X_train).mean()which equals 0.637. However this is not the case when looking at the plot, the number is actually not even within [0,1]. I tried doing the log in different basis (10, e, 2) assuming it would be some kind of monotonic transformation... but still not luck. How can I get to this base_value?

!pip install shap

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap

X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train)

print(clf.predict(X_train).mean())

# load JS visualization code to notebook
shap.initjs()

explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_train)

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0,:], X_train.iloc[0,:])
desertnaut
  • 57,590
  • 26
  • 140
  • 166
G. Macia
  • 1,204
  • 3
  • 23
  • 38

1 Answers1

6

To get base_value in raw space (when link="identity") you need to unwind class labels --> to probabilities --> to raw scores. Note, the default loss is "deviance", so the raw is inverse sigmoid:

# probabilites
y = clf.predict_proba(X_train)[:,1]
# raw scores, default link="identity"
y_raw = np.log(y/(1-y))
# expected raw score
print(np.mean(y_raw))
print(np.isclose(explainer.expected_value, np.mean(y_raw), 1e-12))
2.065861773054686
[ True]

The relevant plot for 0th data point in raw space:

shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="identity")

enter image description here

Should you wish to switch to sigmoid probability space (link="logit"):

from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print(expit(y_raw))
0.8875405774316522

The relevant plot for 0th data point in probability space:

enter image description here

Note, the probability base_value from shap's perspective, what they call a baseline probability if no data is available, is not what a reasonable person would define by having no independent variables (0.6373626373626373 in this case)


Full reproducible example:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
print(shap.__version__)

X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train.values.ravel())

# load JS visualization code to notebook
shap.initjs()

explainer = shap.TreeExplainer(clf, model_output="raw")
shap_values = explainer.shap_values(X_train)

from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print("Expected raw score (before sigmoid):", y_raw)
print("Expected probability:", expit(y_raw))

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="logit")

Output:

0.36.0
Expected raw score (before sigmoid): 2.065861773054686
Expected probability: 0.8875405774316522

enter image description here

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
  • Great! Makes more sense now. Could you add a link to the deviance loss in the reply? I would like to see the actual formula xgboost uses and why sigmoid is the inverse. – G. Macia Nov 03 '20 at 10:40
  • @G.Macia you keep on referrig to xgboost while your question is about GBT classifier in scikit-learn (I have edited your title); see the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) for the default setting `loss='deviance'` – desertnaut Nov 03 '20 at 10:59
  • Should it be `shap_values[0,:]` or `shap_values[1,:]` in the plot(s)? – desertnaut Nov 03 '20 at 12:12
  • @desertnaut My understanding `0` or `1` is the row index for the data point of interest – Sergey Bushmanov Nov 03 '20 at 12:15
  • 1
    You are right, since here you have kept only the `[:,1]` elements in `y` (i.e. probability of class 1). Regarding the `expected_value`, it is supposed to be the average prediction by the model in the underlying dataset (straightforward in regression but maybe no so much here), and not when no data is available. I agree nevertheless that this is not what most people would consider a baseline (excellent answer BTW, sorry I cannot upvote twice). – desertnaut Nov 03 '20 at 12:31