I have a data set about home prices in the US. The data span 50 different states. I want to build a GBM per state in a parallel fashion. I also want to take advantage of the cv.folds
argument in the gbm
package in R. I want to do a 3-fold CV to get the best n.trees
value.
My Code:
library(gbm)
library(plyr)
library(doMC)
doMC::registerDoMC(cores = detectCores())
gbms = dlply(.data = df, .variables = "State", .fun = function(df_temp) {
gbm(log(Price) ~ .,
data = df_temp[, c(features, outcome)],
distribution = "gaussian",
n.trees = 5000,
shrinkage = 0.001,
interaction.depth = 3,
n.minobsinnode = 10,
bag.fraction = 0.5,
train.fraction = 0.8,
cv.folds = 3, # if I turn this to 0, the code runs fine
keep.data = FALSE
)
}, .parallel = TRUE
)
The above code returns the following error:
Error in do.ply(i) : task 1 failed - "cannot open the connection"
However, if I change cv.folds = 3
to cv.folds = 0
the code runs fine and I get my 50 GBMs but they are not optimized against n.trees
.
Note that if I set .parallel = FALSE
then the code works fine but it takes a very very long time since it would run on a single core. I also got the same exact error when I tried building the models with foreach
.
How can I fix this? Your help would be greatly appreciated.