6

I have to make a regression with randomforest in R. My problem is that my dataframe is huge: I have 12 variables and more than 400k entries. When I try - the code is written in the bottom - to get a randomForest regression the system takes many hours to process the data: after 5, 6 hours of calculation, I am obliged to stop the operation without any output. Someone can suggests me how I can get it faster? Thanks

library(caret)
library(randomForest)

dataset <- read.csv("/home/anonimo/Modelli/total_merge.csv", header=TRUE)
dati <- data.frame(dataset)
attach(dati)


trainSet <- dati[2:107570,]
testSet <- dati[107570:480343,]

output.forest <- randomForest(dati$Clip_pm25 ~ dati$e_1 + dati$Clipped_so + dati$Clip_no2 + dati$t2m_1 + dati$tp_1 + dati$Clipped_nh  +  dati$Clipped_co + dati$Clipped_o3 + dati$ssrd_1 + dati$Clipped_no + dati$Clip_pm10 + dati$sp_1, data=trainSet, ntree=250)
Lupanoide
  • 3,132
  • 20
  • 36

3 Answers3

14

I don't think to parallelize on a single PC (2-4 cores) is the answer. There are plenty of lower hanging fruits to pick.

1) RF models increase in complexity with number of training samples. The average tree depth would be something like log(480,000/5)/log(2) = 16.5 intermediary nodes. In the vast majority of examples 2000-10000 samples per tree is fine. If you competing to win on kaggle, a small extra performance really matters, as winner takes all. In practice, you probably don't need that.

2) Don't clone you data set in your R code and try to only keep one copy of your data set (pass by reference is of course fine). It's not a big problem for this data set, as the dataset is not that big (~38Mb) even for R.

3) Don't use formula interface with randomForest algorithm for large datasets. It will make an extra copy of the data set. But again memory is not that much of a problem.

4) Use a faster RF algorithm: extraTrees, ranger or Rborist are available for R. extraTrees is not exactly a RF algorithm but pretty close.

5) avoid categorical features with more than 10 categories. RF can handle up to 32, but becomes super slow as any 2^32 possible split has to be evaluated. extraTrees and Rborist handle more categories by only testing some random selected splits (which works fine). Another solution as in the python-sklearn every category are assigned a unique integer, and the feature is handled as numeric. You can convert your categorical features with as.numeric and before runing randomForest to do the same trick.

6) For much bigger data. Split the data set in random blocks and train a few(~10) trees on each. Combine forests or save forests separate. This will slightly increase the tree correlation. There are some nice cluster implementation to train like these. But won't be necessary for datasets below 1-100Gb, depending on tree complexity etc.

#below I use solution 1-3) and get a run time of some minutes

library(randomForest)
#simulate data 
dataset <- data.frame(replicate(12,rnorm(400000)))
dataset$Clip_pm25 = dataset[,1]+dataset[,2]^2+dataset[,4]*dataset[,3]
#dati <- data.frame(dataset) #no need to keep the data set, an extra time in memory
#attach(dati) #if you attach dati you don't need to write data$Clip_pm25, just Clip_pm25
#but avoid formula interface for randomForest for large data sets because it cost extra memory and time 

#split data in X and y manually
y = dataset$Clip_pm25
X = dataset[,names(dataset) != "Clip_pm25"]
rm(dataset);gc()

object.size(X) #38Mb, no problemo

#if you were using formula interface
#output.forest <- randomForest(dati$Clip_pm25 ~ dati$e_1 + dati$Clipped_so + dati$Clip_no2 + dati$t2m_1 + dati$tp_1 + dati$Clipped_nh  +  dati$Clipped_co + dati$Clipped_o3 + dati$ssrd_1 + dati$Clipped_no + dati$Clip_pm10 + dati$sp_1, data=trainSet, ntree=250)
#output.forest <- randomForest(dati$Clip_pm25 ~ ., ntree=250) # use dot to indicate all variables

#start small, and scale up slowly
rf = randomForest(X,y,sampsize=1000,ntree=5) #runtime ~15 seconds
print(rf) #~67% explained var

#you probably really don't need to exeed 5000-10000 samples per tree, you could grow 2000 trees to sample most of training set
rf = randomForest(X,y,sampsize=5000,ntree=500) # runtime ~5 minutes
print(rf) #~87% explained var


#regarding parallel
#here you could implement some parallel looping
#.... but is it really worth for a 2-4 x speedup?
#coding parallel on single PC is fun but rarely worth the effort

#If you work at some company or university with a descent computer cluster,
#then you can spawn the process across 20-80-200 nodes and get a ~10-60-150 x speedup
#I can recommend the BatchJobs package
Soren Havelund Welling
  • 1,823
  • 1
  • 16
  • 23
4

Since you are using caret, you could use the method = "parRF". This is an implementation of parallel randomforest.

For example:

library(caret)
library(randomForest)
library(doParallel)

cores <- 3
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)

dataset <- read.csv("/home/anonimo/Modelli/total_merge.csv", header=TRUE)
dati <- data.frame(dataset)
attach(dati)


trainSet <- dati[2:107570,]
testSet <- dati[107570:480343,]

# 3 times cross validation.
my_control <- trainControl(method = "cv", number = 3 )

my_forest <- train(Clip_pm25 ~ e_1 + Clipped_so + Clip_no2 + t2m_1 + tp_1 + Clipped_nh  +  Clipped_co + Clipped_o3 + ssrd_1 + Clipped_no + Clip_pm10 + sp_1, , 
                   data = trainSet,
                   method = "parRF",
                   ntree = 250,
                   trControl=my_control)

Here is a foreach implementation as well:

foreach_forest <- foreach(ntree=rep(250, cores), 
                          .combine=combine, 
                          .multicombine=TRUE, 
                          .packages="randomForest") %dopar%
   randomForest(Clip_pm25 ~ e_1 + Clipped_so + Clip_no2 + t2m_1 + tp_1 + Clipped_nh  +  Clipped_co + Clipped_o3 + ssrd_1 + Clipped_no + Clip_pm10 + sp_1, 
                   data = trainSet, ntree=ntree)

# don't forget to stop the cluster
stopCluster(cl)

Remember I didn't set any seeds. You might want to consider this as well. And here is a link to a randomforest package that also runs in parallel. But I have not tested this.

phiver
  • 23,048
  • 14
  • 44
  • 56
  • Thanks for the answers, but this code returns with with an error when I try to compute my_forest : Error in summary.connection(connection) : invalid connection – Lupanoide Jan 10 '16 at 15:29
  • The one downside to R's parallel implementation is that it has to start a new instance and copy the data to each. This can lead to memory issues quickly if using commodity hardware. – Zelazny7 Jan 10 '16 at 15:39
  • @phiver I get an error with the neww code, when the for is have to be computed: Error in unserialize(socklist[[n]]) : error reading from connection. – Lupanoide Jan 10 '16 at 15:58
  • @Lupanoide, try restarting your r session. see also [this post](http://stackoverflow.com/questions/24327137/error-in-unserializesocklistn-error-reading-from-connection-on-unix) and [this one](http://stackoverflow.com/questions/25097729/un-register-a-doparallel-cluster) – phiver Jan 10 '16 at 16:04
  • thanks phiver, now it works from 5 minutes. I hope that it will finish soon, I'll update you – Lupanoide Jan 10 '16 at 16:22
1

The other two answers are good. Another option is to actually use more recent packages that are purpose-built for highly dimensional / high volume data sets. They run their code using lower-level languages (C++ and/or Java) and in certain cases use parallelization.

I'd recommend taking a look into these three:

ranger (uses C++ compiler) randomForestSRC (uses C++ compiler) h2o (Java compiler - needs Java version 8 or higher) Also, some additional reading here to give you more to go off on which package to choose: https://arxiv.org/pdf/1508.04409.pdf

Page 8 shows benchmarks showing the performance improvement of ranger against randomForest against growing data size - ranger is WAY faster due to linear growth in runtime rather than non-linear for randomForest for rising tree/sample/split/feature sizes.

Good Luck!

Rish Rish
  • 99
  • 2