I'm trying to use LIME to explain a binary classification model that I've trained using XGboost. I run into an error when calling the explain()
function from LIME, which implies that I have columns that aren't matching in my model (or explainer) and the new data I'm trying to explain predictions for.
This vignette for LIME does demonstrate a version with xgboost, however it's a text problem which is a little different to my tabular data. This question seems to be encountering the same error, but also for a document term matrix, which seems to obscure the solution for my case. I've worked up a minimal example with mtcars
which produced exactly the same errors I get in my own larger dataset.
library(pacman)
p_load(tidyverse)
p_load(xgboost)
p_load(Matrix)
p_load(lime)
### Prepare data with partition
df <- mtcars %>% rownames_to_column()
length <- df %>% nrow()
df_train <- df %>% select(-rowname) %>% head((length-10))
df_test <- df %>% select(-rowname) %>% tail(10)
### Transform data into matrix objects for XGboost
train <- list(sparse.model.matrix(~., data = df_train %>% select(-vs)), (df_train$vs %>% as.factor()))
names(train) <- c("data", "label")
test <- list(sparse.model.matrix(~., data = df_test %>% select(-vs)), (df_test$vs %>% as.factor()))
names(test) <- c("data", "label")
dtrain <- xgb.DMatrix(data = train$data, label=train$label)
dtest <- xgb.DMatrix(data = test$data, label=test$label)
### Train model
watchlist <- list(train=dtrain, test=dtest)
mod_xgb_tree <- xgb.train(data = dtrain, booster = "gbtree", eta = .1, nrounds = 15, watchlist = watchlist)
### Check prediction works
output <- predict(mod_xgb_tree, test$data) %>% tibble()
### attempt lime explanation
explainer <- df_train %>% select(-vs) %>% lime(model = mod_xgb_tree) ### works, no error or warning
explanation <- df_test %>% select(-vs) %>% explain(explainer, n_features = 4) ### error, Features stored names in `object` and `newdata` are different!
names_test <- test$data@Dimnames[[2]] ### 10 names
names_mod <- mod_xgb_tree$feature_names ### 11 names
names_explainer <- explainer$feature_type %>% enframe() %>% pull(name) ### 11 names
### see whether pre-processing helps
my_preprocess <- function(df){
data <- df %>% select(-vs)
label <- df$vs
test <<- list(sparse.model.matrix( ~ ., data = data), label)
names(test) <<- c("data", "label")
dtest <- xgb.DMatrix(data = test$data, label=test$label)
dtest
}
explanation <- df_test %>% explain(explainer, preprocess = my_preprocess(), n_features = 4) ### Error in feature_distribution[[i]] : subscript out of bounds
### check that the preprocessing is working ok
dtest_check <- df_test %>% my_preprocess()
output_check <- predict(mod_xgb_tree, dtest_check)
I assume that because the explainer
only has the names of the original predictor columns, where test data in its transformed state also has an (Intercept)
column, this is causing the problem. I just haven't figured out a neat way of preventing this occurring. Any help would be much appreciated. I assume there must be a neat solution.