3

I'm working with a very large set of data, about 120,000 rows and 34 columns. As you can well image, when using the R package randomForest, the program takes quite a number of hours to run, even on a powerful Windows server.

Although I am no expert in randomForest, I have a question about the proper use of the combine() function.

I seem to get conflicting answers when I researched this question online. Some say that you can only use combine() when using randomForest on the same set of data. Others say that you can just use combine().

What I'd like (hope, dream) to do is break up the 120,000 rows of data into 6 data frames, each containing 20,000 rows and perform randomForest on each of the 6 data frames. My hope is that I can use the combine() function to then combine the results of all 6 together. Is that possible?

Any help in this matter would be greatly appreciated.

  • Training using sub-forests is a good idea. I don't know about the `combine` function, but I know distributedR has a [distributed randomForest](https://github.com/vertica/DistributedR/tree/master/algorithms/HPdclassifier) implementation that could be a solution to your problem. – Tad Dallas Sep 19 '15 at 15:36
  • The combine() might cause you troubles as you write. I would think the easiest work-around is to not use the combine function. Just train some forests and place them in a list and aggregate votes across all forests. Ohh even better, try to set sampsize=5000 and train on entire data. Then only 5000 samples chosen for each tree and it should run quite fast. – Soren Havelund Welling Sep 20 '15 at 20:16

1 Answers1

2

a couple of hours seems a lot of time. Are you sure you are running on an optimized machine? Perhaps you could experiment on Linux and AWS EC2. Also check out ranger which has been out since a couple of weeks http://arxiv.org/abs/1508.04409 and https://cran.r-project.org/web/packages/ranger/index.html

Also check parallel execution of random forest in R

Community
  • 1
  • 1
ECII
  • 10,297
  • 18
  • 80
  • 121