Review performance of smaller model subsets of a large Random Forest model?

Question

I'm constrained by the memory footprint / size of my Random Forest model so would prefer the number of trees be as low as possible and the depth of trees to be as shallow as possible while minimizing any impact on performance. Rather than needing to set-up hyperparameter tuning to optimize for this, I am wondering whether I can just build one large Random Forest composed of many deep trees. From this, can I then get an estimate of the performance of hypothetical smaller models enclosed within (and save myself the time of hyperparameter tuning -- again I'm looking to just tune on those parameters that generally just need to be "big enough" for the data/problem)?

For example, if I build a model with 1500 trees, could I just extract 500 of these and build a prediction from these to give an estimate of the performance of using just 500 trees (if I do this repeatedly, each time evaluating performance on a holdout set, I figure this should give an estimate of the performance of building a model with 500 trees -- unless I'm missing something?) I should be able to do this similarly with max tree depth or minimum node size, correct?

How would I do this in R on a ranger model?

(Would appreciate any examples, with parsnip would be a bonus. Also guidance / verification that this is a reasonable approach to use to avoid hyperparameter tuning for Random Forest models for those hyperparameters that simply need to be "big"/"deep" enough would also be helpful.)

Predicting from a subset of trees is easy with `ranger`: its `predict` method has a parameter `predict.all` that returns per-tree predictions which you can aggregate as needed. For predicting from pruned trees I suspect you're stuck finding the individual tree objects and messing with them at a low level, and I'm not sufficiently versed in R and `ranger` to give a quick answer. — Ben Reiniger, Jul 13 '22 at 15:32
Item 2 in open issue on github would facilitate pruning (though no work on): https://github.com/imbs-hl/ranger/issues/490 — Bryan Shalloway, Jul 13 '22 at 16:59
Helpful comment here on using an l1 penalty on tree predictions: https://stackoverflow.com/a/66269162/9059865 (In theory could use this with the predict.all method -- add an l1 penalty and review performance as trees are cut). — Bryan Shalloway, Jul 13 '22 at 17:05
The second comment in that github issue gives a hint at how the individual trees are stored; it seems like it shouldn't be hard to use the children array to prune the trees to a given depth. If I get the time to try it out, I'll post. — Ben Reiniger, Jul 13 '22 at 17:59

Review performance of smaller model subsets of a large Random Forest model?

0 Answers0