Parallelizing random forests

Question

Through searching and asking, I've found many packages I can use to make use of all the cores of my server, and many packages that can do random forest.

I'm quite new at this, and I'm getting lost between all the ways to parallelize the training of my random forest. Could you give some advice on reasons to use and/or avoid each of them, or some specific combinations of them (and with or without caret ?) that have made their proof ?

Packages for parallelization :

doParallel,

doSNOW,

doSMP (discontinued ?),

doMC

(and what about mclapply ?)

Packages for random forest :

[caret + some of the following]

rf,

parRF,

randomForest,

ranger,

Rborist,

parallelRandomForest (crashes my R Studio session...)

Thanks

So does this mean you decided that you need a very large number of trees? — Tim Biegeleisen, May 13 '16 at 15:01
I've managed to reduce the number of features used thanks to your advice (and some feature engineering too) as well as the training time. But unfortunately, it seems I still need to have many trees, yes. (But I might be doing some things wrong, I'm still exploring. — François M., May 13 '16 at 15:12
General advice: This question is a bit broad, so it might not attract too many answers. It would be better to, for example, just focus on the parallel computing R packages, and better yet to even ask about a single package with random forests. — Tim Biegeleisen, May 13 '16 at 15:15
I know, I even expected it to be downvoted. The thing is, I've found so many things, and combinations of parallelization packages & random forests packages that I'm getting lost on which combination is fitting my needs. — François M., May 13 '16 at 15:53

score 3 · Accepted Answer · edited May 23 '17 at 12:24

3

There are a few answers on SO, such as parallel execution of random forest in R and Suggestions for speeding up Random Forests, that I would take a look at.

Those posts are helpful, but are a bit older. the ranger package is an especially fast implementation of random forest, so if you are new to this it might be the easiest way to speed up your model training. Their paper discusses the tradeoffs of some of the available packages - depending on your data size and number of features, which package gives you the best performance will vary.

edited May 23 '17 at 12:24

Community

1
1

answered May 13 '16 at 15:29

Tchotchke

3,061
3
22
37

Thanks. Regarding the first link, will `.multicombine=TRUE` work with `caret` + `ranger` ? If so, how can I pass it through `train()` ? – François M. May 13 '16 at 15:59
Regarding your second link : if I use `caret` + `allowParallel = TRUE` in `train()`, I must not use the `foreach` syntax, right ? Do I still have to do `registerDoParallel(makeCluster(detectCores()))` (from `doParallel`, for instance) before ? Or on the contrary, will it cause a problem ? – François M. May 13 '16 at 16:00
1

The 'ranger' package is really cool tool to speed up random forest calculations. Checked it recently. – Andrii Dec 12 '17 at 17:42

Parallelizing random forests

1 Answers1

Linked