1

Good afternoon, all--thank you in advance for your help! I'm somewhat new to R, so my apologies if this is a trivial or otherwise inappropriate question.

TL;DR: I'm trying to determine Variable Importance (VIM) for factor variables with a random forest model built-in RandomForestSRC, which is not a built-in feature of that package. Using both the LIME and DALEX packages, I encounter the same error: cannot coerce class 'c("rfsrc, "predict", "class")' to a data.frame. Any assistance resolving this error, or alternate approaches, would be greatly appreciated!

I have a random forest model I've built in R, using the RandomForestSRC package. The model seems to work great--training and testing went fine, got the predicted output I needed, results seem in-line with what I would expect. Unfortunately, one of the requirements is that I need to be able to indicate how the model arrived at its conclusions (eg, I need to also include variable importance as a part of the output), for both continuous and factor variables.

This doesn't seem to be a built-in feature with the RandomForestSRC package, so I've looked into both the LIME and DALEX packages, both of which should be able to break out VIM from the existing RF model. Unfortunately, neither have native support for the RFSRC package, which means I've needed to build in the prediction functions myself, as recommended by this vignette:https://uc-r.github.io/dalex

model_type.rfsrc <- function (x, ...) {
    return ('classification')
}

predict_model.rfsrc <- function (x, newdata, type, ...) {
    as.data.frame(predict(x, newdata, ...)
}

Unfortunately, in running the VIM section of the model (in both LIME and DALEX), I'm asked to pass both the predicted output and the model that created that output. In doing so, it hits an error with the above predict_model function:

error in as.data.frame.default(predict(model, (newdata))):
cannot coerce class 'c("rfsrc, "predict", "class")' to a data.frame

And, like...of course, it can't; it's trying to turn the model itself into a data frame. Unfortunately, while I think I understand why R is giving me that error, that's about as far as I've been able to figure out on my own.

Additionally, I'm using the RandomForestSRC package for two reasons: it doesn't put a limit on the number of factor variables, and it can handle imbalanced data. I'm working with medical data, so both of these are necessary (eg, there are ~100,000 different medical codes that can be encoded in a single data variable, and the ratio of "people-who-don't-have-this-condition" vs "people-who-do-have-this-condition" is frequently 100 to 1). If anyone has any suggestions for alternative packages that handle these issues, though, and have built-in VIM functionality (or integrate with DALEX / LIME), that would be fantastic as well.

Thank you all very much for your help!

Sinval
  • 1,315
  • 1
  • 16
  • 25

0 Answers0