3

I am trying to use parallel computation to compute percentile bootstrap 95% confidence intervals for least absolute deviations regression parameters, as explained in this article. However, I am not using a single data frame, but rather a multiply imputed data set (mids) object, obtained with the mice package for multiple imputation. This is where the problem lies.

I would like to use the mids (or a list of multiply imputed data sets) object in a foreach loop, perform the bootstrapping, and pool the results. I managed to get results based on just one single data set by converting the mids object into a list and then use one single element of that list. Nonetheless, I would like to use all data sets at once.

A reproducible example:

library(foreach)
library(doParallel)
cores_2_use <- detectCores() - 1

cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
registerDoParallel(cl)

library(mice)
imp_merged <-
  foreach(no = 1:cores_2_use, 
          .combine = ibind, 
          .export = "nhanes",
          .packages = "mice") %dopar%
  {
    mice(nhanes, m = 30, printFlag = FALSE)
  }
stopCluster(cl)

And here what I have tried:

library(quantreg)
library(mitml)
library(miceadds)
library(splines)

cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
registerDoParallel(cl)

boot.1 <- foreach(i = 1:100,
                  .combine = rbind,
                  .packages = c('quantreg', 'mice', 'mitml', 'splines')) %dopar% {
                    
                    longlist <- miceadds::mids2datlist(imp_merged)
                    boot_dat <- longlist[[6]][sample(1:nrow(longlist[[6]]), replace = TRUE), ]
                    ## This is now based only on the 6th element of longlist
                    ## I would like to use the whole mids/longlist object (330 data sets on my PC)
                    
                    fit1 <- rq(chl ~ ns(bmi, df = 2, B = c(21, 33)) +
                                 hyp + age, tau = 0.5,
                               data = boot_dat)
                    fit1$coef
                  }
stopCluster(cl)

boot.1.df <- as.data.frame(boot.1)
boot.1.pooled <- do.call(cbind, boot.1.df)
boot.1.ci <- apply(boot.1.pooled, 2, quantile, probs = c(0.025, 0.975))
t(boot.1.ci)

I converted the mids object into a list of multiply imputed data sets with longlist <- miceadds::mids2datlist(imp_merged) and performed the sampling based on one single element (i.e., imputed data set) of that list through boot_dat <- longlist[[6]][sample(1:nrow(longlist[[6]]), replace = TRUE), ]. I would like to use the whole mids object or all elements of longlist.

Any help will be much appreciated!

Dion Groothof
  • 1,406
  • 5
  • 15
  • Is this really an issue with parallelization? It seems you have an issue with what is inside the foreach loop. – F. Privé Jul 01 '20 at 05:27
  • You are right, thank you. I edited the post accordingly. – Dion Groothof Jul 01 '20 at 13:01
  • 1
    When you say you want to use the whole mids object, you want the pooled results across all 30 imputation samples, right? (btw, since you set `m = 30`, is the "330 data sets" a typo?) You cold apply the bootstrap algorithm to all 30 imputations, obtain the bootstrap standard error of the statistic you are interested in and use `mice::pool.scalar` to combine between- and within imputations variance. You may have to transform your statistics before pooling (and back-transform the pooled statistic afterwards), see [Stef van Buuren's ebook](https://stefvanbuuren.name/fimd/sec-pooling.html). – benimwolfspelz Jul 01 '20 at 14:48
  • 1
    @benimwolfspelz I'm not sure it is a typo. Just that there are 11 cores? – F. Privé Jul 01 '20 at 14:55
  • @F.Privé I see where that number could come from. But this would mean that Dion made 330 imputation samples, which is an extreme number given that [experts recommend](https://stefvanbuuren.name/fimd/sec-howmany.html) 20-100 (or 5-10 in the past). – benimwolfspelz Jul 01 '20 at 15:06
  • @benimwolfspelz Thanks for the suggestion. I am well aware of the recommended number of imputation samples. The code stems from [this answer](https://stackoverflow.com/a/27087791/10852113), which I just used to create a reproducible example. – Dion Groothof Jul 01 '20 at 15:21

1 Answers1

1

One possible way is to simply combine the datasets into one big data set, and to sample from it directly.

longlist_ = longlist[[1]]
for (j in 2:length(longlist))
  {
    longlist_ = rbind(longlist_,longlist[[i]])
  }
boot_dat <- longlist_[sample(1:nrow(longlist[[6]]), replace = TRUE), ]

Another way is to randomly choose a data set, and random choose a row, and repeat for several times.

boot_dat = NULL
for (j in seq(nrow(longlist[[6]])))
  {
    boot_dat = rbind(boot_dat, 
               longlist[[sample(length(longlist),1)]][sample(nrow(longlist[[1]]),1),])
  }

Note that to avoid the error of Singular design matrix in rq, a small noise could be added.

boot_dat[,'hyp'] = boot_dat[,'hyp'] + runif(nrow(boot_dat), -1e-10, 1e-10)
Ryan SY Kwan
  • 476
  • 3
  • 9