2

I am unable to reproduce the only example I can find of using h2o with iml (https://www.r-bloggers.com/2018/08/iml-and-h2o-machine-learning-model-interpretability-and-feature-explanation/) as detailed here (Error when extracting variable importance with FeatureImp$new and H2O). Can anyone point to a workaround or other examples of using iml with h2o?


Reproducible example:

library(rsample)   # data splitting
library(ggplot2)   # allows extension of visualizations
library(dplyr)     # basic data transformation
library(h2o)       # machine learning modeling
library(iml)       # ML interprtation
library(modeldata) #attrition data 


# initialize h2o session
h2o.no_progress()
h2o.init()

# classification data
data("attrition", package = "modeldata")
df <- rsample::attrition %>% 
  mutate_if(is.ordered, factor, ordered = FALSE) %>%
  mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>% factor(levels = c("1", "0")))

# convert to h2o object
df.h2o <- as.h2o(df)

# create train, validation, and test splits
set.seed(123)
splits <- h2o.splitFrame(df.h2o, ratios = c(.7, .15), destination_frames = 
    c("train","valid","test"))
names(splits) <- c("train","valid","test")

# variable names for resonse & features
y <- "Attrition"
x <- setdiff(names(df), y) 

# elastic net model 
glm <- h2o.glm(
  x = x, 
  y = y, 
  training_frame = splits$train,
  validation_frame = splits$valid,
  family = "binomial",
  seed = 123
  )

# 1. create a data frame with just the features
features <- as.data.frame(splits$valid) %>% select(-Attrition)

# 2. Create a vector with the actual responses
response <- as.numeric(as.vector(splits$valid$Attrition))

# 3. Create custom predict function that returns the predicted values as a
#    vector (probability of purchasing in our example)
pred <- function(model, newdata)  {
  results <- as.data.frame(h2o.predict(model, as.h2o(newdata)))
  return(results[[3L]])
}

# create predictor object to pass to explainer functions
predictor.glm <- Predictor$new(
  model = glm, 
  data = features, 
  y = response, 
  predict.fun = pred,
  class = "classification"
  )

imp.glm <- FeatureImp$new(predictor.glm, loss = "mse")

Error obtained:

Error in `[.data.frame`(prediction, , self$class, drop = FALSE): undefined columns 
selected

traceback()

1. FeatureImp$new(predictor.glm, loss = "mse")

2. .subset2(public_bind_env, "initialize")(...)

3. private$run.prediction(private$sampler$X)

4. self$predictor$predict(data.frame(dataDesign))

5. prediction[, self$class, drop = FALSE]

6. `[.data.frame`(prediction, , self$class, drop = FALSE)

7. stop("undefined columns selected")
Zoe
  • 27,060
  • 21
  • 118
  • 148

1 Answers1

3

In the iml package documentation, it says that the class argument is "The class column to be returned.". When you set class = "classification", it's looking for a column called "classification" which is not found. At least on GitHub, it looks like the iml package has gone through a fair amount of development since that blog post, so I imagine some functionality may not be backwards compatible anymore.

After reading through the package documentation, I think you might want to try something like:

predictor.glm <- Predictor$new(
  model = glm, 
  data = features, 
  y = "Attrition",
  predict.function = pred,
  type = "prob"
  )

# check ability to predict first
check <- predictor.glm$predict(features)
print(check)

Even better might be to leverage H2O's extensive functionality around machine learning interpretability.

h2o.varimp(glm) will give the user the variable importance for each feature

h2o.varimp_plot(glm, 10) will render a graphic showing the relative importance of each feature.

h2o.explain(glm, as.h2o(features)) is a wrapper for the explainability interface and will by default provide the confusion matrix (in this case) as well as variable importance, and partial dependency plots for each feature.

For certain algorithms (e.g., tree-based methods), h2o.shap_explain_row_plot() and h2o.shap_summary_plot() will provide the shap contributions.

The h2o-3 docs might be useful here to explore more

TheFon
  • 31
  • 2
  • Also H2O-3 supports permutation variable importance(https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/permutation-variable-importance.html) in the recent versions. One thing to add to this answer is to use a different loss than "mse" for the classification task (e.g. `FeatureImp$new(predictor.glm, loss = "f1")` works for me). – Tomáš Frýda Feb 18 '22 at 13:44
  • See also answer at: https://stackoverflow.com/questions/72930868/errors-in-the-tutorial-interpreting-machine-learning-models-with-the-iml-packag – tomek Jul 28 '22 at 15:06