What do maskers really do in SHAP package and fit them to train or test?

Question

I have been trying to work with the shap package. I want to determine the shap values from my logistic regression model. Contrary to the TreeExplainer, the LinearExplainer requires a so-called masker. What exactly does this masker do and what is the difference between the independent and partition maskers?

Also, I am interested in the important features from the test-set. Do I then fit the masker on the training set or the test set? Below you can see a snippet of code.

model = LogisticRegression(random_state = 1)
model.fit(X_train, y_train)

masker = shap.maskers.Independent(data = X_train)
**or**
masker = shap.maskers.Independent(data = X_test)

explainer = shap.LinearExplainer(model, masker = masker)
shap_val = explainer(X_test)```

Sergey Bushmanov · Accepted Answer · 2022-04-02T16:11:47.990

Masker class provides a background data to "train" your explainer against. I.e., in:

explainer = shap.LinearExplainer(model, masker = masker)

you're using background data determined by masker (you may see what data is used by accessing masker.data attribute). You may read more about "true to model" or "true to data" explanations here or here.

Given above, caluclation-wise you may do both:

masker = shap.maskers.Independent(data = X_train)

or

masker = shap.maskers.Independent(data = X_test)
explainer = shap.LinearExplainer(model, masker = masker)

but conceptually, imo the following makes more sense:

masker = shap.maskers.Independent(data = X_train)
explainer = shap.LinearExplainer(model, masker = masker)

This is akin usual train/test paradigm, where you train your model (and explainer) on train data, and try to predict (and explain) your test data.

Unrelated to the question. An alternative to masker, which samples data for you, would be to explicitly provide background that may allow comparing 2 datapoints: a point against which compare, and the point of interest, like in this notebook. In such a manner one may find out why 2 seemingly similar datapoints were classified differently.

I wonder if Shap value from the same model object should be different based on different test data, or should one model has only one shap value? — HC_2016, Feb 03 '23 at 16:21
@HC_2016 They will be different. You may google for "true to model" or "true to data" discussions on github or check out [this](https://arxiv.org/abs/2006.16234) article — Sergey Bushmanov, Feb 03 '23 at 20:40

What do maskers really do in SHAP package and fit them to train or test?

1 Answers1

Linked

Related