1

I am pretty new to machine learning, and I've stumbled upon an issue and can't seem to find a solution no matter how hard I google.

I have performed a multiclass classification procedure using a randomForest algorithm and found a model that offers adequate prediction of my test sample. I then used varImpPlot() to determine which predictors are most important to the determining the class assignments.

My problem: I would like to know why those predictors are most important. Specifically, I would like to be able to report that cases that fall into Class X hold Characteristics A (e.g., are male), B (e.g., are older), and C (e.g., have high IQ), while cases that fall into Class Y hold Characteristics D (female), E (younger), and F (low IQ), and so on for the rest of my classes.

I know that standard binary logistic regression allows you to say that cases with high values on Characteristic A are more likely to fall into class X, for example. So, I was hoping for something conceptually similar, but from a random forest classification model on multiple classes.

Is this a thing that can be done using random forest models? If yes, is there a function in randomForest or in caret (or even elsewhere) that can help me get past the varImpPlot() and varImp() table?

Thanks!

deschampst
  • 131
  • 7
  • What you are looking for is the **relative importance of variables**. The output of `varImpPlot()` is the overall variable importance. – Seymour Apr 09 '18 at 21:28
  • try to check: https://stackoverflow.com/questions/29637145/gbm-r-function-get-variable-importance-separately-for-each-class https://stackoverflow.com/questions/47609200/how-to-get-different-variable-importance-for-each-class-in-a-binary-h2o-gbm-in-r?noredirect=1&lq=1 please keep us update because this is a very important topic for which is difficult to find an answer – Seymour Apr 09 '18 at 21:31
  • One possible approximation of the relative importance for each class is to build N model 1 vs all where N is the number of class to predict. However, I see this more as a work-around than a real robust solution to the problem you are facing. – Seymour Apr 09 '18 at 21:34

1 Answers1

0

There is a package named ExplainPrediction that promises an explanation for random forest models. Here's the top of DESCRIPTION file. The URL page has a link to an extensive citation list:

Package: ExplainPrediction
Title: Explanation of Predictions for Classification and Regression Models
Version: 1.3.0
Date: 2017-12-27
Author: Marko Robnik-Sikonja
Maintainer: Marko Robnik-Sikonja <marko.robnik@fri.uni-lj.si>
Description: Generates explanations for classification and regression models and visualizes them.
 Explanations are generated for individual predictions as well as for models as a whole. Two explanation methods
 are included, EXPLAIN and IME. The EXPLAIN method is fast but might miss explanations expressed redundantly
 in the model. The IME method is slower as it samples from all feature subsets.
 For the EXPLAIN method see Robnik-Sikonja and Kononenko (2008) <doi:10.1109/TKDE.2007.190734>, 
 and the IME method is described in Strumbelj and Kononenko (2010, JMLR, vol. 11:1-18).
 All models in package 'CORElearn' are natively supported, for other prediction models a wrapper function is provided 
 and illustrated for models from packages 'randomForest', 'nnet', and 'e1071'.
License: GPL-3
URL: http://lkm.fri.uni-lj.si/rmarko/software/
Imports: CORElearn (>= 1.52.0),semiArtificial (>= 2.2.5)
Suggests: nnet,e1071,randomForest

Also:

Package: DALEX
Title: Descriptive mAchine Learning EXplanations
Version: 0.1.1
Authors@R: person("Przemyslaw", "Biecek", email = "przemyslaw.biecek@gmail.com", role = c("aut", "cre"))
Description: Machine Learning (ML) models are widely used and have various applications in classification 
  or regression. Models created with boosting, bagging, stacking or similar techniques are often
  used due to their high performance, but such black-box models usually lack of interpretability.
  'DALEX' package contains various explainers that help to understand the link between input variables and model output.
  The single_variable() explainer extracts conditional response of a model as a function of a single selected variable.
  It is a wrapper over packages 'pdp' and 'ALEPlot'.
  The single_prediction() explainer attributes arts of model prediction to articular variables used in the model.
  It is a wrapper over 'breakDown' package.
  The variable_dropout() explainer assess variable importance based on consecutive permutations.
  All these explainers can be plotted with generic plot() function and compared across different models.
Depends: R (>= 3.0)
License: GPL
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.0.1.9000
Imports: pdp, ggplot2, ALEPlot, breakDown
Suggests: gbm, randomForest, xgboost
URL: https://pbiecek.github.io/DALEX/
BugReports: https://github.com/pbiecek/DALEX/issues
NeedsCompilation: no
Packaged: 2018-02-28 01:44:36 UTC; pbiecek
Author: Przemyslaw Biecek [aut, cre]
Maintainer: Przemyslaw Biecek <przemyslaw.biecek@gmail.com>
Repository: CRAN
Date/Publication: 2018-02-28 16:36:14 UTC
Built: R 3.4.3; ; 2018-04-03 03:04:04 UTC; unix
IRTFM
  • 258,963
  • 21
  • 364
  • 487