5

I have sklearn random forest regressor. It's very heavy, 1.6 GBytes, and works very long time when predicting values.

I want to prune it to make lighter. As I know pruning is not implemented for decision trees and forests. I can't implement it by myself since tree code is written on C and I don't know it.

Does anyone know the solution?

hvedrung
  • 467
  • 4
  • 13
  • 2
    I think that you should limit the size of the trees (max leaf noders, max depth, min samples split...) – mcane Jul 24 '15 at 13:08
  • 3
    http://stackoverflow.com/questions/7830255/suggestions-for-speeding-up-random-forests – invoketheshell Jul 24 '15 at 13:47
  • invoketheshell, thank you for link. The main idea there is to use parallized state of forest to use all CPU cores. It is already done in my case. – hvedrung Jul 24 '15 at 14:53

2 Answers2

3

The size of the trees can be a solution for you. Try to limit the size of the trees in the forest (max leaf noders, max depth, min samples split...).

mcane
  • 1,696
  • 1
  • 15
  • 25
  • This means rebuilding the regressor. IT was a lengthy path to select parameters so I want to modify existing regressor if possible. – hvedrung Jul 24 '15 at 14:42
  • 1
    Random forest classifier (in theory) needs to run all tree classifiers and their voting produces the final decision. @invoketheshell suggested you to parallelise the problem and this is the only option if you do not like to touch the classifier (and prune the trees) at all. Throw more hardware into the problem it could save your time ;). – mcane Jul 27 '15 at 05:50
1

You could try ensemble pruning. This boils down to removing from your random forest a number of the decision trees that make it up.

If you remove trees at random, the expected outcome is that the performance of the ensemble will gradually deteriorate with the number of removed trees. However, you can do something more clever like removing those trees whose predictions are highly correlated with the predictions of the rest of the ensemble, and thus do to significantly modify the outcome of the whole ensemble.

Alternatively, you can train a linear classifier that uses as inputs the outputs of the individual ensembles, and include some kind of l1 penalty in the training to enforce sparse weights on the classifier. The weights with 0 or very small value will hint which trees could be removed from the ensemble with a small impact on accuracy.

albarji
  • 850
  • 14
  • 20
  • I've been reviewing the literature and it looks like this approach can improve accuracy and speed. Why isn't it implemented by default? – Andrew Brēza Aug 19 '22 at 02:40