20

I want to run 150 multiple imputations by using mice in R. However, in order to save some computing time, I would like to subdivide the process in parallel streams (as suggested by Stef van Buuren in "Flexible Imputation for Missing Data").

My question is: how to do that?

I can imagine 2 options:

opt.1:

imp1<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp2<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp...<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp150<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)

and then combine the imputations together by using complete and as.mids afterwards

opt.2:

imp1<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp2<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp...<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp150<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)

by adding VAL_1to150 otherwise it seems to me (I may be wrong) that if they all run with the same dataset and the same seed you will have 150 times the same result.

Are there any other options?

Thanks

Jaap
  • 81,064
  • 34
  • 182
  • 193
  • I think the reason you haven't had any answers is that your question is too broad, and not specific enough. There are many resources on the topic of parallel processing using R. Try to build some code that implements your options, then post a more specific question if you run into trouble. – Andrie Jun 05 '14 at 13:24
  • Dividing per se doesn't save computing time. You would need to look into parallelization packages such as `parallel`, `snow` or `multicore`. However, learning how these work will probalby cost more time than what you save with your imputation. – SimonG Aug 30 '14 at 12:21
  • Are you still interested in the answer? – Aleksandr Blekh Oct 02 '14 at 18:30
  • @SimonG: It's not that difficult in terms of implementation, I'd say the most challenging part is to know what to parallelize and how in terms of application data. I recently implemented exactly what Emanuela was trying to do and I'm extremely satisfied by the result. You can read more in my answer here: http://stackoverflow.com/a/26154261/2872891. – Aleksandr Blekh Oct 02 '14 at 18:37
  • Yes, Aleksandr, I'm still interested! I will definitively try the procedure you suggest. – emanuela.struffolino Oct 03 '14 at 06:35

2 Answers2

36

So the main problem is combining the imputations and as I see it there are three options, using ibind, complete as described or trying to keep the mids structure. I strongly suggest the ibind solution. The others are left in the answer for those curious.

Get parallel results

Before doing anything we need to get the parallel mice imputations. The parallel part is rather simple, all we need to do is to use the parallel package and make sure that we set the seed using the clusterSetRNGStream:

library(parallel)
# Using all cores can slow down the computer
# significantly, I therefore try to leave one
# core alone in order to be able to do something 
# else during the time the code runs
cores_2_use <- detectCores() - 1

cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
clusterExport(cl, "nhanes")
clusterEvalQ(cl, library(mice))
imp_pars <- 
  parLapply(cl = cl, X = 1:cores_2_use, fun = function(no){
    mice(nhanes, m = 30, printFlag = FALSE)
  })
stopCluster(cl)

The above will yield cores_2_use * 30 imputed datasets.

Using ibind

As @AleksanderBlekh suggested, the mice::ibind is probably the best, most straightforward solution:

imp_merged <- imp_pars[[1]]
for (n in 2:length(imp_pars)){
  imp_merged <- 
    ibind(imp_merged,
          imp_pars[[n]])
}

Using foreach with ibind

The perhaps the simplest alternative is to use foreach:

library(foreach)
library(doParallel)
cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
registerDoParallel(cl)

library(mice)
imp_merged <-
  foreach(no = 1:cores_2_use, 
          .combine = ibind, 
          .export = "nhanes",
          .packages = "mice") %dopar%
{
  mice(nhanes, m = 30, printFlag = FALSE)
}
stopCluster(cl)

Using complete

Extracting the full datasets using complete(..., action="long"), rbind-ing these and then using as.mids other mice objects may work well but it generates a slimmer object than what the other two approaches:

merged_df <- nhanes
merged_df <- 
  cbind(data.frame(.imp = 0,
                   .id = 1:nrow(nhanes)),
        merged_df)
for (n in 1:length(imp_pars)){
  tmp <- complete(imp_pars[[n]], action = "long")
  tmp$.imp <- as.numeric(tmp$.imp) + max(merged_df$.imp)
  merged_df <- 
    rbind(merged_df,
          tmp)
}

imp_merged <- 
  as.mids(merged_df)

# Compare the most important the est and se for easier comparison
cbind(summary(pool(with(data=imp_merged,
                        exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
      summary(pool(with(data=mice(nhanes, 
                                  m = 60, 
                                  printFlag = FALSE),
                        exp=lm(bmi~age+hyp+chl))))[,c("est", "se")])

Gives the output:

                    est         se         est         se
(Intercept) 20.41921496 3.85943925 20.33952967 3.79002725
age         -3.56928102 1.35801557 -3.65568620 1.27603817
hyp          1.63952970 2.05618895  1.60216683 2.17650536
chl          0.05396451 0.02278867  0.05525561 0.02087995

Keeping a correct mids-object

My alternative approach below shows how to merge imputation objects and retain the full functionality behind the mids object. Since the ibind solution I've left this in for anyone interested in exploring how to merge complex lists.

I've looked into mice's mids-object and there are a few step that you have to take in order to get at least a similar mids-object after running in parallel. If we examine the mids-object and compare two objects with two different setups we get:

library(mice)
imp <- list()
imp <- c(imp,
         list(mice(nhanes, m = 40)))
imp <- c(imp,
         list(mice(nhanes, m = 20)))

sapply(names(imp[[1]]),
       function(n)
         try(all(useful::compare.list(imp[[1]][[n]], 
                                      imp[[2]][[n]]))))

Where you can see that the call, m, imp, chainMean, and chainVar differ between the two runs. Out of these the imp is without doubt the most important but it seems like a wise option to update the other components as well. We will therefore start by building a mice merger function:

mergeMice <- function (imp) {
  merged_imp <- NULL
  for (n in 1:length(imp)){
    if (is.null(merged_imp)){
      merged_imp <- imp[[n]]
    }else{
      counter <- merged_imp$m
      # Update counter
      merged_imp$m <- 
        merged_imp$m + imp[[n]]$m
      # Rename chains
      dimnames(imp[[n]]$chainMean)[[3]] <-
        sprintf("Chain %d", (counter + 1):merged_imp$m)
      dimnames(imp[[n]]$chainVar)[[3]] <-
        sprintf("Chain %d", (counter + 1):merged_imp$m)
      # Merge chains
      merged_imp$chainMean <- 
        abind::abind(merged_imp$chainMean, 
                     imp[[n]]$chainMean)
      merged_imp$chainVar <- 
        abind::abind(merged_imp$chainVar, 
                     imp[[n]]$chainVar)
      for (nn in names(merged_imp$imp)){
        # Non-imputed variables are not in the
        # data.frame format but are null
        if (!is.null(imp[[n]]$imp[[nn]])){
          colnames(imp[[n]]$imp[[nn]]) <- 
            (counter + 1):merged_imp$m
          merged_imp$imp[[nn]] <- 
            cbind(merged_imp$imp[[nn]],
                  imp[[n]]$imp[[nn]])
        }
      }
    }
  }
  # TODO: The function should update the $call parameter
  return(merged_imp)
}

We can now simply merge the two above generated imputations through:

merged_imp <- mergeMice(imp)
merged_imp_pars <- mergeMice(imp_pars)

Now it seems that we get the right output:

# Compare the three alternatives
cbind(
  summary(pool(with(data=merged_imp,
                    exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
 summary(pool(with(data=merged_imp_pars,
                    exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
 summary(pool(with(data=mice(nhanes, 
                             m = merged_imp$m, 
                             printFlag = FALSE),
                   exp=lm(bmi~age+hyp+chl))))[,c("est", "se")])

Gives:

                    est         se         est        se
(Intercept) 20.16057550 3.74819873 20.31814393 3.7346252
age         -3.67906629 1.19873118 -3.64395716 1.1476377
hyp          1.72637216 2.01171565  1.71063127 1.9936347
chl          0.05590999 0.02350609  0.05476829 0.0213819
                    est         se
(Intercept) 20.14271905 3.60702992
age         -3.78345532 1.21550474
hyp          1.77361005 2.11415290
chl          0.05648672 0.02046868

Ok, that's it. Have fun.

Max Gordon
  • 5,367
  • 2
  • 44
  • 70
  • Just FYI: I think that you could have simplified things a lot by using `mice`'s `ibind()` instead of your `mergeMice()`: http://www.inside-r.org/packages/cran/mice/docs/ibind. – Aleksandr Blekh Nov 29 '14 at 02:40
  • 1
    Thanks @AleksandrBlekh - I've added this as the recommended solution. Annoying that I didn't find it at a first glance. – Max Gordon Nov 29 '14 at 09:23
  • My pleasure! I too, haven't noticed it right away - somebody pointed me to this approach. BTW, when adding the reference, you've made a typo in the function name throughout the answer (with the exception of the code block) - it's `ibind`. – Aleksandr Blekh Nov 29 '14 at 09:33
  • This is a great answer -- I both upvoted it and saved it to Evernote -- but I am having trouble with the first solution. I have a 45000x64 data set with about 5% missingness. Running the code from the first part of the solution took about 20 minutes on a 4 core 2.5GHz laptop with m = 1, as a test. Running the same code on a server with 24 cores of equal speed took twice as long. Where am I going wrong? I tried it with m = 30 but after 2 hours I had to kill it because I didn't have enough time to see how long it would take to finish. – Hack-R Mar 16 '15 at 20:17
  • @Hack-R: In my experience this is due to lack of memory. If you run into the max memory a Windows machine will start swapping (a virtual halt) while a Linux machine crashes. I recently wrote a blog post about parallelization: http://gforge.se/2015/02/how-to-go-parallel-in-r-basics-tips/ I would recommend that you try and reduce the number of columns in your dataset (if possible), alternatively save the imputed datasets on file, return the file name, do a `rm(imp_data); gc();` and then load the files after the imputation. – Max Gordon Mar 16 '15 at 20:26
  • Great answer Max, I went through your blog as well which is how I figured out what was going wrong when I ran the above example on my data. There are three columns to be imputed. Ideally, all each could be assigned to one core each but that isn't happening. Instead, all the imputations are being performed by one core while the other two "sit" idle. The whole process is taking over 14 hours. Any idea why this is happening? –  Jun 29 '16 at 13:53
  • @Aayush you can't impute a single column. Mice forces sequential imputation where you generate multiple random datasets that allow you to estimate the error. Thus you have to spread the datasets among the cores and not the columns. – Max Gordon Jun 29 '16 at 14:10
  • Oh, right! Wonder why there isn't any significant improvement in performance then. –  Jun 29 '16 at 14:18
  • Thanks a lot for this detailed explanation. I am using the 1st parallel version and use the mice::ibind to get the imp_merged. I am going to save the merged imputed data to csv but when I outpu thte imp_merged$data to the csv, I found that there are still null value. I think I export the wrong data but I don't know which one is the merged imputed data. – user1285419 Mar 11 '21 at 07:47
  • there is now a wrapper in `mice` for parallel imputation: `parlmice`: https://rdrr.io/cran/mice/man/parlmice.html – denis Sep 15 '22 at 13:27
0

mice 3.15.0 and later has a new futuremice() function that provides parallel processing. It replaces the older parlmice() approach.

Stef van Buuren
  • 348
  • 1
  • 10