I'm using glmnet
to fit some models and am cross-validating for lambda
. I'm using cv.glmnet
by default (since it does complete cross-validation
of lambda
internally), but below I focus on the first step of that function, which is the one causing issues.
First data setup. I haven't made a reproducible example and can't share the raw data, but dim(smat)
is roughly 4.7M rows by 50 columns, about half of which are dense. I tried a simplistic approach to reproducing the issue with completely random columns, to no avail.
# data setup (censored)
library(data.table)
DT = fread(...)
n_cv = 10L
# assign cross-validation group to an ID (instead of to a row)
IDs = DT[ , .(rand_id = runif(1L)), keyby = ID]
IDs[order(rand_id), cv_grp := .I %% n_cv + 1L]
DT[IDs, cv_grp := i.cv_grp, on = 'ID']
# key by cv_grp to facilitate subsetting different training sets
setkey(DT, cv_grp)
# assign row number as column to facilitate subsetting model matrix
DT[ , rowN := .I]
library(glmnet)
library(Matrix)
# y is 0/1 (actually TRUE/FALSE)
model = y ~ ...
smat = sparse.model.matrix(model, data = DT)
# this is what's done internally to 0-1 data to create
# an n x 2 matrix with FALSE in the 1st and TRUE in the 2nd column
ymat = diag(2L)[factor(DT$y), ]
The following is a tailored version of what cv.glmnet
does before passing to cv.lognet
:
train_models = lapply(seq_len(n_cv), function(i) {
train_idx = DT[!.(i), rowN]
glmnet(smat[train_idx, , drop = FALSE], ymat[train_idx, ],
alpha = 1, family = 'binomial')
})
This appears to work fine, but is quite slow. If we replace this by the equivalent version for parallel = TRUE
:
library(doMC)
registerDoMC(detectCores())
train_models_par = foreach(i = seq_len(n_cv), .packages = c("glmnet", "data.table")) %dopar% {
train_idx = DT[!.(i), rowN]
glmnet(smat[train_idx, , drop = FALSE], ymat[train_idx, ],
alpha = 1, family = 'binomial')
}
The glmnet
call fails silently on some nodes (compared to any(sapply(train_models, is.null))
which is FALSE
):
sapply(train_models_par, is.null)
# [1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
Which task fails is inconsistent (so it's not a problem with, e.g., cv_grp = 2
per se). I've tried capturing the output of glmnet
and checking is.null
to no avail. I've also added the .verbose = TRUE
flag to foreach
and nothing suspicious emerges. Note that the data.table
syntax is ancillary, as the default behavior of cv.glmnet
(which also results in similar failures) relies on using which = foldid == i
to split training and test sets.
How can I debug this problem? Why might the task fail when parallelized, but not serially, and how can I catch when the task has failed (so that I can try-and-retry, for example)?
Current info about environment:
sessionInfo()
# R version 3.4.3 (2017-11-30)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 16.04.3 LTS
#
# Matrix products: default
# BLAS: /usr/lib/libblas/libblas.so.3.6.0
# LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
#
# locale:
# [1] LC_CTYPE=en_US.UTF-8
# [2] LC_NUMERIC=C
# [3] LC_TIME=en_US.UTF-8
# [4] LC_COLLATE=en_US.UTF-8
# [5] LC_MONETARY=en_US.UTF-8
# [6] LC_MESSAGES=en_US.UTF-8
# [7] LC_PAPER=en_US.UTF-8
# [8] LC_NAME=C
# [9] LC_ADDRESS=C
# [10] LC_TELEPHONE=C
# [11] LC_MEASUREMENT=en_US.UTF-8
# [12] LC_IDENTIFICATION=C
#
# attached base packages:
# [1] parallel stats graphics grDevices utils
# [6] datasets methods base
#
# other attached packages:
# [1] ggplot2_2.2.1 doMC_1.3.5
# [3] iterators_1.0.8 glmnet_2.0-13
# [5] foreach_1.4.3 Matrix_1.2-12
# [7] data.table_1.10.5
#
# loaded via a namespace (and not attached):
# [1] Rcpp_0.12.14 lattice_0.20-35
# [3] codetools_0.2-15 plyr_1.8.3
# [5] grid_3.4.3 gtable_0.1.2
# [7] scales_0.5.0 rlang_0.1.4
# [9] lazyeval_0.2.1 tools_3.4.3
# [11] munsell_0.4.2 yaml_2.1.13
# [13] compiler_3.4.3 colorspace_1.2-4
# [15] tibble_1.3.4
system('free -m')
# total used free shared buff/cache available
# Mem: 30147 1786 25087 1 3273 28059
# Swap: 0 0 0
detectCores()
# [1] 16
system('lscpu | grep "Model name"')
# Model name: Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz