2

I'm trying to use LIME to explain a binary classification model that I've trained using XGboost. I run into an error when calling the explain() function from LIME, which implies that I have columns that aren't matching in my model (or explainer) and the new data I'm trying to explain predictions for.

This vignette for LIME does demonstrate a version with xgboost, however it's a text problem which is a little different to my tabular data. This question seems to be encountering the same error, but also for a document term matrix, which seems to obscure the solution for my case. I've worked up a minimal example with mtcars which produced exactly the same errors I get in my own larger dataset.

library(pacman)
p_load(tidyverse)
p_load(xgboost)
p_load(Matrix)
p_load(lime)

### Prepare data with partition
df <- mtcars %>% rownames_to_column()
length <- df %>% nrow()
df_train <- df %>% select(-rowname) %>% head((length-10))
df_test <- df %>% select(-rowname) %>% tail(10)

### Transform data into matrix objects for XGboost
train <- list(sparse.model.matrix(~., data = df_train %>% select(-vs)), (df_train$vs %>% as.factor()))
names(train) <- c("data", "label")
test <- list(sparse.model.matrix(~., data = df_test %>% select(-vs)), (df_test$vs %>% as.factor()))
names(test) <- c("data", "label")
dtrain <- xgb.DMatrix(data = train$data, label=train$label)
dtest <- xgb.DMatrix(data = test$data, label=test$label)


### Train model
watchlist <- list(train=dtrain, test=dtest)
mod_xgb_tree <- xgb.train(data = dtrain,  booster = "gbtree", eta = .1, nrounds = 15, watchlist = watchlist)

### Check prediction works
output <- predict(mod_xgb_tree, test$data) %>% tibble()

### attempt lime explanation
explainer <- df_train %>% select(-vs) %>% lime(model = mod_xgb_tree)  ### works, no error or warning
explanation <- df_test %>% select(-vs) %>% explain(explainer, n_features = 4) ### error, Features stored names in `object` and `newdata` are different!

names_test <- test$data@Dimnames[[2]]  ### 10 names
names_mod <- mod_xgb_tree$feature_names ### 11 names
names_explainer <- explainer$feature_type %>% enframe() %>% pull(name) ### 11 names


### see whether pre-processing helps
my_preprocess <- function(df){
  data <- df %>% select(-vs)
  label <- df$vs

  test <<- list(sparse.model.matrix( ~ ., data = data), label)
  names(test) <<- c("data", "label")

  dtest <- xgb.DMatrix(data = test$data, label=test$label)
  dtest
}

explanation <- df_test %>% explain(explainer, preprocess = my_preprocess(), n_features = 4) ### Error in feature_distribution[[i]] : subscript out of bounds

### check that the preprocessing is working ok
dtest_check <- df_test %>% my_preprocess()
output_check <- predict(mod_xgb_tree, dtest_check)

I assume that because the explainer only has the names of the original predictor columns, where test data in its transformed state also has an (Intercept) column, this is causing the problem. I just haven't figured out a neat way of preventing this occurring. Any help would be much appreciated. I assume there must be a neat solution.

Aidan Morrison
  • 118
  • 2
  • 6

4 Answers4

2

I had the same problem but the columns weren't in alphabetical order. To fix this, I matched the order of the column names in the df_test to df_train so that the column names were in the same order.

Create list of df_test column numbers in same order as df_train:

    idx<- match(colnames(df_train), colnames(df_test))

Create new df_test file using this column order:

    df_test_match <- df_test[,idx]
Rachel
  • 21
  • 2
1

If you look at this page (https://rdrr.io/cran/xgboost/src/R/xgb.Booster.R), you will see that some R users are likely to get the following error message: "Feature names stored in object and newdata are different!".

Here is the code from this page related to the error message:

predict.xgb.Booster <- function(object, newdata, missing = NA, outputmargin = FALSE, ntreelimit = NULL,predleaf = FALSE, predcontrib = FALSE, approxcontrib = FALSE, predinteraction = FALSE,reshape = FALSE, ...)

object <- xgb.Booster.complete(object, saveraw = FALSE)
      if (!inherits(newdata, "xgb.DMatrix"))
        newdata <- xgb.DMatrix(newdata, missing = missing)
      if (!is.null(object[["feature_names"]]) &&
          !is.null(colnames(newdata)) &&
          !identical(object[["feature_names"]], colnames(newdata)))
        stop("Feature names stored in `object` and `newdata` are different!")

identical(object[["feature_names"]], colnames(newdata)) => If the column names of object (i.e. your model based on your training set) are not identical to the column names of newdata (i.e. your test set), you will get the error message.

For more details:

train_matrix <- xgb.DMatrix(as.matrix(training %>% select(-target)), label = training$target, missing = NaN)
object <- xgb.train(data=train_matrix, params=..., nthread=2, nrounds=..., prediction = T)
newdata <- xgb.DMatrix(as.matrix(test %>% select(-target)), missing = NaN)

While setting by yourself object and newdata with your data thanks to the code above, you can probably fix this issue by looking at the differences between object[["feature_names"]] and colnames(newdata). Probably some columns that don't appear in the same order or something.

guiotan
  • 149
  • 13
1

Try this in your new dataset,

   colnames(test)<- make.names(colnames(test))

   newdataset<- test %>% mutate_all(as.numeric)

   newdataset<- as.matrix(newdataset)

   nwtest<-xgb.DMatrix(newdataset)
  • I have trouble with the first line.. Which object shoudld I call `colnames()<-` on? If I do on the dataframe `df_test` I get a bunch of NAs for colnames, and can't proceed. – Aidan Morrison Jun 22 '20 at 13:31
0

To prevent the (Intercept) column showing up, you need to change your code slightly when creating the sparse matrix for your test data. Change the line:

test <- list(sparse.model.matrix( ~ ., data = data), label)

to:

test <- list(sparse.model.matrix( ~ .-1, data = data), label)

Hope this helps

Aypak
  • 1
  • 2