1

I have a data set about home prices in the US. The data span 50 different states. I want to build a GBM per state in a parallel fashion. I also want to take advantage of the cv.folds argument in the gbm package in R. I want to do a 3-fold CV to get the best n.trees value.

My Code:

library(gbm)
library(plyr)
library(doMC)
doMC::registerDoMC(cores = detectCores())

gbms = dlply(.data = df, .variables = "State", .fun = function(df_temp) {
    gbm(log(Price) ~ ., 
        data = df_temp[, c(features, outcome)],
        distribution = "gaussian",
        n.trees = 5000,
        shrinkage = 0.001,
        interaction.depth = 3,
        n.minobsinnode = 10,
        bag.fraction = 0.5,
        train.fraction = 0.8,
        cv.folds = 3, # if I turn this to 0, the code runs fine
        keep.data = FALSE
        )
    }, .parallel = TRUE
  )

The above code returns the following error:

Error in do.ply(i) : task 1 failed - "cannot open the connection"

However, if I change cv.folds = 3 to cv.folds = 0 the code runs fine and I get my 50 GBMs but they are not optimized against n.trees.

Note that if I set .parallel = FALSE then the code works fine but it takes a very very long time since it would run on a single core. I also got the same exact error when I tried building the models with foreach.

How can I fix this? Your help would be greatly appreciated.

  • If you're on Windows, note that doMC requires Linux. Use doParallel instead. – Hong Ooi May 21 '17 at 03:56
  • Maybe this [Q&A on SO](http://stackoverflow.com/questions/40426115/how-to-use-domc-under-windows-or-alternative-parallel-processing-implementation) is of interest to you. – KoenV May 21 '17 at 06:22
  • I'm running my script on an EC2 instance with Amazon Linux AMI so its not a Windows problem. – Tony Kassab May 21 '17 at 07:00

0 Answers0