How to use a multiply imputed data set (mids) object in a foreach loop in R?

Question

I am trying to use parallel computation to compute percentile bootstrap 95% confidence intervals for least absolute deviations regression parameters, as explained in this article. However, I am not using a single data frame, but rather a multiply imputed data set (mids) object, obtained with the mice package for multiple imputation. This is where the problem lies.

I would like to use the mids (or a list of multiply imputed data sets) object in a foreach loop, perform the bootstrapping, and pool the results. I managed to get results based on just one single data set by converting the mids object into a list and then use one single element of that list. Nonetheless, I would like to use all data sets at once.

A reproducible example:

library(foreach)
library(doParallel)
cores_2_use <- detectCores() - 1

cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
registerDoParallel(cl)

library(mice)
imp_merged <-
  foreach(no = 1:cores_2_use, 
          .combine = ibind, 
          .export = "nhanes",
          .packages = "mice") %dopar%
  {
    mice(nhanes, m = 30, printFlag = FALSE)
  }
stopCluster(cl)

And here what I have tried:

library(quantreg)
library(mitml)
library(miceadds)
library(splines)

cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
registerDoParallel(cl)

boot.1 <- foreach(i = 1:100,
                  .combine = rbind,
                  .packages = c('quantreg', 'mice', 'mitml', 'splines')) %dopar% {
                    
                    longlist <- miceadds::mids2datlist(imp_merged)
                    boot_dat <- longlist[[6]][sample(1:nrow(longlist[[6]]), replace = TRUE), ]
                    ## This is now based only on the 6th element of longlist
                    ## I would like to use the whole mids/longlist object (330 data sets on my PC)
                    
                    fit1 <- rq(chl ~ ns(bmi, df = 2, B = c(21, 33)) +
                                 hyp + age, tau = 0.5,
                               data = boot_dat)
                    fit1$coef
                  }
stopCluster(cl)

boot.1.df <- as.data.frame(boot.1)
boot.1.pooled <- do.call(cbind, boot.1.df)
boot.1.ci <- apply(boot.1.pooled, 2, quantile, probs = c(0.025, 0.975))
t(boot.1.ci)

I converted the mids object into a list of multiply imputed data sets with longlist <- miceadds::mids2datlist(imp_merged) and performed the sampling based on one single element (i.e., imputed data set) of that list through boot_dat <- longlist[[6]][sample(1:nrow(longlist[[6]]), replace = TRUE), ]. I would like to use the whole mids object or all elements of longlist.

Any help will be much appreciated!

Is this really an issue with parallelization? It seems you have an issue with what is inside the foreach loop. — F. Privé, Jul 01 '20 at 05:27
When you say you want to use the whole mids object, you want the pooled results across all 30 imputation samples, right? (btw, since you set `m = 30`, is the "330 data sets" a typo?) You cold apply the bootstrap algorithm to all 30 imputations, obtain the bootstrap standard error of the statistic you are interested in and use `mice::pool.scalar` to combine between- and within imputations variance. You may have to transform your statistics before pooling (and back-transform the pooled statistic afterwards), see [Stef van Buuren's ebook](https://stefvanbuuren.name/fimd/sec-pooling.html). — benimwolfspelz, Jul 01 '20 at 14:48
@benimwolfspelz I'm not sure it is a typo. Just that there are 11 cores? — F. Privé, Jul 01 '20 at 14:55
@F.Privé I see where that number could come from. But this would mean that Dion made 330 imputation samples, which is an extreme number given that [experts recommend](https://stefvanbuuren.name/fimd/sec-howmany.html) 20-100 (or 5-10 in the past). — benimwolfspelz, Jul 01 '20 at 15:06
@benimwolfspelz Thanks for the suggestion. I am well aware of the recommended number of imputation samples. The code stems from [this answer](https://stackoverflow.com/a/27087791/10852113), which I just used to create a reproducible example. — Dion Groothof, Jul 01 '20 at 15:21

Ryan SY Kwan · Accepted Answer · 2020-07-06T08:34:34.180

One possible way is to simply combine the datasets into one big data set, and to sample from it directly.

longlist_ = longlist[[1]]
for (j in 2:length(longlist))
  {
    longlist_ = rbind(longlist_,longlist[[i]])
  }
boot_dat <- longlist_[sample(1:nrow(longlist[[6]]), replace = TRUE), ]

Another way is to randomly choose a data set, and random choose a row, and repeat for several times.

boot_dat = NULL
for (j in seq(nrow(longlist[[6]])))
  {
    boot_dat = rbind(boot_dat, 
               longlist[[sample(length(longlist),1)]][sample(nrow(longlist[[1]]),1),])
  }

Note that to avoid the error of Singular design matrix in rq, a small noise could be added.

boot_dat[,'hyp'] = boot_dat[,'hyp'] + runif(nrow(boot_dat), -1e-10, 1e-10)

How to use a multiply imputed data set (mids) object in a foreach loop in R?

1 Answers1