12

I have a dataset consisting of 20 features and roughly 300,000 observations. I'm using caret to train model with doParallel and four cores. Even training on 10% of my data takes well over eight hours for the methods I've tried (rf, nnet, adabag, svmPoly). I'm resampling with with bootstrapping 3 times and my tuneLength is 5. Is there anything I can do to speed up this agonizingly slow process? Someone suggested using the underlying library can speed up my the process as much as 10x, but before I go down that route I'd like to make sure there is no other alternative.

milos.ai
  • 3,882
  • 7
  • 31
  • 33
Alexander David
  • 769
  • 2
  • 8
  • 19
  • To ask the obvious: Would it be possible for you to work with a subset of the 300K observations? You could show that a 30K subset behaves the same way as the full 300K set. – Tim Biegeleisen Oct 02 '15 at 01:57
  • Hi Tim, sorry if I wasn't clear. 8 hours was for training 30k observations (10%). Training on 1% takes a reasonable amount of time, but is not very predictive. Here's a question for you: my outcome is a binary factor ('Yes'/'No') but 'Yes' only occurs in about 20% of my total dataset. Do you think providing a test set with a more even split (say 50/50 'Yes'/'No') might allow me to train on a smaller sample size? – Alexander David Oct 02 '15 at 02:11
  • It's really that slow on 30k x 20? That's really surprising. How much RAM are you working with? – devmacrile Oct 02 '15 at 02:11
  • I've got 4GB. random forests seem to take an incredible amount of time. – Alexander David Oct 02 '15 at 02:14
  • 1
    I don't think you should be statistically altering your input data set. I would be surprised if you can't take a smaller subset than what you have. That being said, 300K observations is not that large in the grand scheme of things. – Tim Biegeleisen Oct 02 '15 at 02:21
  • 1
    I agree, 300K is not that large which is what makes the performance so upsetting. – Alexander David Oct 02 '15 at 02:24

3 Answers3

18

@phiver hits the nail on the head but, for this situation, there are a few things to suggest:

  • make sure that you are not exhausting your system memory by using parallel processing. You are making X extra copies of the data in memory when using X workers.
  • with a class imbalance, additional sampling can help. Downsampling might help improve performance and take less time.
  • use different libraries. ranger instead of randomForest, xgboost or C5.0 instead of gbm. You should realize that ensemble methods are fitting a ton of constituent models and a bound to take a while to fit.
  • the package has a racing-type algorithm for tuning parameters in less time
  • the development version on github has random search methods for the models with a lot of tuning parameters.

Max

kangaroo_cliff
  • 6,067
  • 3
  • 29
  • 42
topepo
  • 13,534
  • 3
  • 39
  • 52
16

What people forget when comparing the underlying model versus using caret is that caret has a lot of extra stuff going on.

Take for example your randomforest. so bootstrap, number 3, and tuneLength 5. So you resample 3 times, and because of the tuneLength you try to find a good value for mtry. In total you run 15 random forests and comparing these to get the best one for the final model, versus only 1 if you use the basic random forest model.

Also you are running parallel on 4 cores and randomforest needs all the observations available, so all your training observations will be 4 times in memory. Probably not much memory left for training the model.

My advice is to start scaling down to see if you can speed things up, like setting the bootstrap number to 1 and tune length back to the default 3. Or even setting the traincontrol method to "none", just to get an idea on how fast the model is on the minimal settings and no resampling.

phiver
  • 23,048
  • 14
  • 44
  • 56
2

Great inputs by @phiver and @topepo. I will try to summarize and add some more points that I gathered from the little bit of SO posts searching that I did for a similar problem:

  • Yes, parallel processing takes more time, with lesser memory. With 8 cores and 64GB RAM, a rule of thumb could be to use 5-6 workers at best.
  • @topepo's page on caret pre-processing here is fantastic. It is step-wise instructive and helps to replace the manual work of pre-processing such as dummy variables, removing multi-collinear /linear combination variables and transformation.
  • One of the reasons the randomForest and other models become really slow is because of the number of factors in categorical variables. It is either advised to club factors or convert to ordinal/numeric transformation if possible.
  • Try using the Tunegrid feature in caret to the fullest for the ensemble models. Start with least values of mtry/ntree for a sample of data and see how it works out in terms of improvement in accuracy improvement.
  • I found out this SO page to be very useful where parRF is suggested primarily. I didn't a lot of improvement in my dataset by replacing RF with parRF but you can try out. The other suggestions there is to use data.table instead of dataframes and use predictor/response data instead of formula. It greatly improves the speed, believe me (But there is a caveat, the performance of predictor/response data (providing x=X, y=Y data.tables) also seems to somehow improve predictive accuracy somehow and change the Variable importance table from factor-wise break up while using formula (Y~.).
Community
  • 1
  • 1
KarthikS
  • 883
  • 1
  • 11
  • 17