1

I'm looking for advice on creating classification trees where each split is based on multiple variables. A bit of background: I'm helping design a vegetation classification system, and we're hoping to use a classification and regression tree algorithm to both classify new veg data and create (or at least help to create) visual keys which can be used in publications. The data I'm using is laid out as community data, with tree species as columns, and observations as rows, and the first column is a factor with classes. I'll also add that I'm very new to this type of analysis, and while I've tried to read about it as much as possible, it's quite likely that I've missed some simple but important aspects. My apologies.

Now the problem: R has excellent packages and great documentation for classification with univariate splits (e.g. rpart, partykit, C5.0). However, I would ideally like to be able to create classification trees where each split was based on multiple criteria - so instead of each split having one decision (e.g. "Percent cover of Species A > 6.67"), it would have multiple (Percent cover of Species A > 6.67 AND Percent cover of Species B < 4.2). I've had a lot of trouble finding packages that are capable of doing multivariate splits and creating trees. This answer: https://stats.stackexchange.com/questions/4356/does-rpart-use-multivariate-splits-by-default has been very useful, and I've tried all the packages suggested there for multivariate splitting. Prim does do multivariate splits, but doesn't seem to make trees; the partDSA package seems to be somewhat what I'm looking for, but it also only creates trees with one criteria per split; the optpart package also doesn't seem to be able to make classification trees. If anyone has advice on how I could go about making a classification tree based on a multivariate partitioning method, that would be super appreciated.

Also, this is my first question, and I am very open to suggestions about how to ask questions. I didn't feel that providing an example would be helpful in this case, but if necessary I easily can.
Many Thanks!

Kiri Daust
  • 23
  • 5
  • 1
    I think random forests should probably be sufficient for your use case. You don't need splits on multiple criteria to build a meaningful/accurate model using decision trees. Look into the `randomForest` package. – Tim Biegeleisen Jan 17 '18 at 01:49
  • While there is a lot research on multivariate trees (see ref), I haven't seen any published R or Python implementations. It's a pretty big assumption that you need multivariate DTs for your problem, and given the lack of package support for such an approach it's almost guaranteed your challenges are better solved using existing libraries. Even RF has some tricks for making them more interpretable, if this is your issue. Ref: http://www.cbcb.umd.edu/~salzberg/docs/murthy_thesis/survey/node11.html – Cybernetic Jan 17 '18 at 02:36
  • What you are talking about defeats the purpose of a decision tree. They are designed to reduce the most entropy in a single division, it makes deciding which variable to split on simple and straight forward reducing the complexity of the model. It is meant to make predicting from them quicker and more definitive. – sconfluentus Jan 17 '18 at 04:30

0 Answers0