4

I am getting the following error when trying to execute the following code in section entitled "Replication requirements" (https://uc-r.github.io/iml-pkg):

#classification data
df <- rsample::attrition %>%
mutate_if(is.ordered, factor, ordered = FALSE) %>%
mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>% factor(levels = c("1", "0")))

> Error: 'attrition' is not an exported object from 'namespace:rsample'

The problem was solved using the following code:

#data
library(modeldata)
data("attrition", package = "modeldata")
#classification data
df <- attrition %>%
mutate_if(is.ordered, factor, ordered = FALSE) %>%
mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>% factor(levels = c("1", "0")))

Unfortunately, I got another error after trying to execute the following code (section entitled "Global interpretation/Feature importance" (https://uc-r.github.io/iml-pkg):

#compute feature importance with specified loss metric
imp.glm <- FeatureImp$new(predictor.glm, loss = "mse")
imp.rf <- FeatureImp$new(predictor.rf, loss = "mse")
imp.gbm <- FeatureImp$new(predictor.gbm, loss = "mse")

> Error in [.data.frame(prediction, , self$class, drop = FALSE) : undefined columns selected

> Error in [.data.frame(prediction, , self$class, drop = FALSE) : undefined columns selected

> Error in [.data.frame(prediction, , self$class, drop = FALSE) : undefined columns selected

I use R 4.2.0/ Win10

tomek
  • 81
  • 5
  • 2
    it is possible that the link may have some typos/errors based on the first error (or possibly it worked in an earlier version of the package) – akrun Jul 10 '22 at 18:24
  • 1
    Just to clarify earlier comment. If you check the `?attrition`, from `rsample`, there is a line which states `These data are now in the modeldata package`. So, it is possible that the info in the link would be old enough to result in errors – akrun Jul 10 '22 at 18:37
  • Compared to the example at https://rdrr.io/cran/iml/man/FeatureImp.html the code seems to be correct – tomek Jul 10 '22 at 18:38
  • it is possible that minor behavior changes in the functions caused this issue. As mentioned in the comment above, this link could be old – akrun Jul 10 '22 at 18:39
  • Yes this is old, there is package & session info at the end of the tutorial page – tomek Jul 10 '22 at 18:52
  • The "rsample" package allows to pull "attrition" data from the "modeldata" package,, see https://cran.r-project.org/web/packages/rsample/vignettes/Working_with_rsets.html – tomek Jul 10 '22 at 19:55
  • Did anybody find solution for this error? rror in `[.data.frame`(prediction, , self$class, drop = FALSE) : undefined columns selected – bvowe Jul 25 '22 at 19:20

2 Answers2

3

The parameters shown in the tutorial need to be altered slightly; instead of class = "classification", change it to class = 2 (per the docs) and the example works as expected:

library(rsample)   # data splitting
library(ggplot2)   # allows extension of visualizations
library(dplyr)     # basic data transformation
library(h2o)       # machine learning modeling
#install.packages("iml")
library(iml)       # ML interprtation
#install.packages("modeldata")
library(modeldata)
library(R6)

h2o.no_progress()
h2o.init()
#>  Connection successful!
#> 
#> R is connected to the H2O cluster: 
#>     H2O cluster uptime:         9 minutes 18 seconds 
#>     H2O cluster timezone:       Australia/Melbourne 
#>     H2O data parsing timezone:  UTC 
#>     H2O cluster version:        3.36.0.1 
#>     H2O cluster version age:    6 months and 28 days !!! 
#>     H2O cluster name:           H2O_started_from_R_jared_mpb432 
#>     H2O cluster total nodes:    1 
#>     H2O cluster total memory:   1.58 GB 
#>     H2O cluster total cores:    4 
#>     H2O cluster allowed cores:  4 
#>     H2O cluster healthy:        TRUE 
#>     H2O Connection ip:          localhost 
#>     H2O Connection port:        54321 
#>     H2O Connection proxy:       NA 
#>     H2O Internal Security:      FALSE 
#>     H2O API Extensions:         Amazon S3, XGBoost, Algos, Infogram, AutoML, Core V3, TargetEncoder, Core V4 
#>     R Version:                  R version 4.1.3 (2022-03-10)

df <- modeldata::attrition %>% 
  mutate_if(is.ordered, factor, ordered = FALSE) %>%
  mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>%
           factor(levels = c("1", "0")))

# convert to h2o object
df.h2o <- as.h2o(df)

# create train, validation, and test splits
set.seed(123)
splits <- h2o.splitFrame(df.h2o, ratios = c(.7, .15), destination_frames = c("train","valid","test"))
names(splits) <- c("train","valid","test")

# variable names for resonse & features
y <- "Attrition"
x <- setdiff(names(df), y) 

# elastic net model 
glm <- h2o.glm(
  x = x, 
  y = y, 
  training_frame = splits$train,
  validation_frame = splits$valid,
  family = "binomial",
  seed = 123
)

# random forest model
rf <- h2o.randomForest(
  x = x, 
  y = y,
  training_frame = splits$train,
  validation_frame = splits$valid,
  ntrees = 1000,
  stopping_metric = "AUC",    
  stopping_rounds = 10,         
  stopping_tolerance = 0.005,
  seed = 123
)
#> Warning in .h2o.processResponseWarnings(res): early stopping is enabled but neither score_tree_interval or score_each_iteration are defined. Early stopping will not be reproducible!.

# gradient boosting machine model
gbm <-  h2o.gbm(
  x = x, 
  y = y,
  training_frame = splits$train,
  validation_frame = splits$valid,
  ntrees = 1000,
  stopping_metric = "AUC",    
  stopping_rounds = 10,         
  stopping_tolerance = 0.005,
  seed = 123
)
#> Warning in .h2o.processResponseWarnings(res): early stopping is enabled but neither score_tree_interval or score_each_iteration are defined. Early stopping will not be reproducible!.

# model performance
h2o.auc(glm, valid = TRUE)
#> [1] 0.7870935
## [1] 0.7870935
h2o.auc(rf, valid = TRUE)
#> [1] 0.7681021
## [1] 0.7681021
h2o.auc(gbm, valid = TRUE)
#> [1] 0.7468242
## [1] 0.7468242

features <- as.data.frame(splits$valid) %>% select(-Attrition)

# 2. Create a vector with the actual responses
response <- as.vector(as.numeric(splits$valid$Attrition))

# 3. Create custom predict function that returns the predicted values as a
#    vector (probability of purchasing in our example)
pred <- function(model, newdata)  {
  results <- as.data.frame(h2o.predict(model, as.h2o(newdata)))
  return(results[[3L]])
}

# example of prediction output
pred(glm, features) %>% head()
#> [1] 0.12243347 0.12887908 0.09674399 0.26008143 0.00672000 0.13741387

predictor.glm <- Predictor$new(
  model = glm, 
  data = features, 
  y = response, 
  predict.fun = pred,
  class = "classification"
)
predictor.glm$predict(features[1:10,])
#> Error in `[.data.frame`(prediction, , self$class, drop = FALSE): undefined columns selected
# class = "classification" doesn't make sense; from the docs:
### The class column to be returned in case of multiclass output.
### You can either use numbers, e.g. class=2 would take the 2nd column
### from the predictions, or the column name of the predicted class,
### e.g. class="dog".
# so, in this case, 'class = 2' should work as expected

predictor.glm <- Predictor$new(
  model = glm, 
  data = features,
  y = response,
  predict.function = pred,
  class = 2
)
predictor.glm$predict(features[1:10,])
#>            p1
#> 1  0.12243347
#> 2  0.12887908
#> 3  0.09674399
#> 4  0.26008143
#> 5  0.00672000
#> 6  0.13741387
#> 7  0.47917917
#> 8  0.11775822
#> 9  0.11316964
#> 10 0.22963757

predictor.rf <- Predictor$new(
  model = rf, 
  data = features, 
  y = response, 
  predict.fun = pred,
  class = 2
)

predictor.gbm <- Predictor$new(
  model = gbm, 
  data = features, 
  y = response, 
  predict.fun = pred,
  class = 2
)

imp.glm <- FeatureImp$new(predictor.glm, loss = "mse")
imp.rf <- FeatureImp$new(predictor.rf, loss = "mse")
imp.gbm <- FeatureImp$new(predictor.gbm, loss = "mse")

p1 <- plot(imp.glm) + ggtitle("GLM")
p2 <- plot(imp.rf) + ggtitle("RF")
p3 <- plot(imp.gbm) + ggtitle("GBM")

#gridExtra::grid.arrange(p1, p2, p3, nrow = 1)
p1

p2

p3

Created on 2022-07-28 by the reprex package (v2.0.1)

jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
  • 1
    The proposed solution works fine and the code is executing, but every feature seems to be important - the results differ from those presented in the tutorial. – tomek Jul 27 '22 at 20:15
  • It was difficult to see the results in the previous version of the code because they were all 'squashed together'; I've updated the code to print each plot individually and changed `class = 1` to `class = 2` to get the same orientation as in the tutorial @tomek – jared_mamrot Jul 27 '22 at 23:35
  • I found a solution to this question presented earlier. See: https://stackoverflow.com/questions/69930234/overcoming-compatibility-issues-with-using-iml-from-h2o-models – tomek Jul 28 '22 at 15:03
1

You can calculate the variable importance (using the h2o package), for your glm model (just choosing one for the example) as follows:

h2o::h2o.varimp(glm)

Example output:

Example output

Does this achieve what you wanted?

Note: I'm assuming you've run all the code up to that point in the link you provided, i.e. that you have created the glm model object using the code provided in the link.

statnet22
  • 444
  • 2
  • 13
  • I used the randomForest, and iml package to successfully depict feature interactions for the wine dataset (not using this tutorial). But here, the question is why I cannot successfully execute the next pieces of code using the iml package (the aim of this tutorial is to calculate and visualize feature interactions by H-statistics). The code (interact.glm <- Interaction$new(predictor.glm) %>% plot() + ggtitle("GLM") ) from the section "Measuring interactions" of this tutorial generate again this error. – tomek Jul 26 '22 at 18:31
  • Interactions in the H20 are documented: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/interactions.html – tomek Jul 26 '22 at 20:15