4

I'm using glmnet to fit some models and am cross-validating for lambda. I'm using cv.glmnet by default (since it does complete cross-validation of lambda internally), but below I focus on the first step of that function, which is the one causing issues.

First data setup. I haven't made a reproducible example and can't share the raw data, but dim(smat) is roughly 4.7M rows by 50 columns, about half of which are dense. I tried a simplistic approach to reproducing the issue with completely random columns, to no avail.

# data setup (censored)
library(data.table)
DT = fread(...)
n_cv = 10L

# assign cross-validation group to an ID (instead of to a row)
IDs = DT[ , .(rand_id = runif(1L)), keyby = ID]
IDs[order(rand_id), cv_grp := .I %% n_cv + 1L]
DT[IDs, cv_grp := i.cv_grp, on = 'ID']

# key by cv_grp to facilitate subsetting different training sets
setkey(DT, cv_grp)
# assign row number as column to facilitate subsetting model matrix
DT[ , rowN := .I]

library(glmnet)
library(Matrix)

# y is 0/1 (actually TRUE/FALSE)
model = y ~ ...
smat = sparse.model.matrix(model, data = DT)
# this is what's done internally to 0-1 data to create
#   an n x 2 matrix with FALSE in the 1st and TRUE in the 2nd column
ymat = diag(2L)[factor(DT$y), ]

The following is a tailored version of what cv.glmnet does before passing to cv.lognet:

train_models = lapply(seq_len(n_cv), function(i) {
  train_idx = DT[!.(i), rowN]
  glmnet(smat[train_idx, , drop = FALSE], ymat[train_idx, ],
         alpha = 1, family = 'binomial')
})

This appears to work fine, but is quite slow. If we replace this by the equivalent version for parallel = TRUE:

library(doMC)
registerDoMC(detectCores())
train_models_par = foreach(i = seq_len(n_cv), .packages = c("glmnet", "data.table")) %dopar% {
  train_idx = DT[!.(i), rowN]
  glmnet(smat[train_idx, , drop = FALSE], ymat[train_idx, ],
         alpha = 1, family = 'binomial')
}

The glmnet call fails silently on some nodes (compared to any(sapply(train_models, is.null)) which is FALSE):

sapply(train_models_par, is.null)
# [1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

Which task fails is inconsistent (so it's not a problem with, e.g., cv_grp = 2 per se). I've tried capturing the output of glmnet and checking is.null to no avail. I've also added the .verbose = TRUE flag to foreach and nothing suspicious emerges. Note that the data.table syntax is ancillary, as the default behavior of cv.glmnet (which also results in similar failures) relies on using which = foldid == i to split training and test sets.

How can I debug this problem? Why might the task fail when parallelized, but not serially, and how can I catch when the task has failed (so that I can try-and-retry, for example)?

Current info about environment:

sessionInfo()
# R version 3.4.3 (2017-11-30)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 16.04.3 LTS
# 
# Matrix products: default
# BLAS: /usr/lib/libblas/libblas.so.3.6.0
# LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8      
#  [2] LC_NUMERIC=C              
#  [3] LC_TIME=en_US.UTF-8       
#  [4] LC_COLLATE=en_US.UTF-8    
#  [5] LC_MONETARY=en_US.UTF-8   
#  [6] LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8      
#  [8] LC_NAME=C                 
#  [9] LC_ADDRESS=C              
# [10] LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_US.UTF-8
# [12] LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] parallel  stats     graphics  grDevices utils    
# [6] datasets  methods   base     
# 
# other attached packages:
# [1] ggplot2_2.2.1     doMC_1.3.5       
# [3] iterators_1.0.8   glmnet_2.0-13    
# [5] foreach_1.4.3     Matrix_1.2-12    
# [7] data.table_1.10.5
# 
# loaded via a namespace (and not attached):
#  [1] Rcpp_0.12.14     lattice_0.20-35 
#  [3] codetools_0.2-15 plyr_1.8.3      
#  [5] grid_3.4.3       gtable_0.1.2    
#  [7] scales_0.5.0     rlang_0.1.4     
#  [9] lazyeval_0.2.1   tools_3.4.3     
# [11] munsell_0.4.2    yaml_2.1.13     
# [13] compiler_3.4.3   colorspace_1.2-4
# [15] tibble_1.3.4   

system('free -m')
# total        used        free      shared  buff/cache   available
# Mem:          30147        1786       25087           1        3273       28059
# Swap:             0           0           0

detectCores()
# [1] 16

system('lscpu | grep "Model name"')
# Model name:            Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • what is your rsession info and system specs? Otherwise maybe just use `try`/`purrr::safely` to see if you can return an error that way? – zacdav Mar 03 '18 at 03:18
  • @zacdav added a bunch of info about my instance. Could you elaborate / point to some example of using `purrr`? As noted, it's not an _error_ that's emerging, and simply trying to catch the output with `is.null` doesn't seem to work either. – MichaelChirico Mar 03 '18 at 04:57
  • did you found any solution for this @MichaelChirico? – abhiieor Jan 07 '20 at 13:53
  • 1
    @abhiieor I'm afraid not sorry – MichaelChirico Jan 07 '20 at 13:57

0 Answers0