3

I'm aware of the fact that Amelia R package provides some support for parallel multiple imputation (MI). However, preliminary analysis of my study's data revealed that the data is not multivariate normal, so, unfortunately, I can't use Amelia. Consequently, I've switched to using mice R package for MI, as this package can perform MI on data that is not multivariate normal.

Since the MI process via mice is very slow (currently I'm using AWS m3.large 2-core instance), I've started wondering whether it's possible to parallelize the procedure to save processing time. Based on my review of mice documentation and the corresponding JSS paper, as well as mice's source code, it appears that currently the package doesn't support parallel operations. This is sad, because IMHO the MICE algorithm is naturally parallel and, thus, its parallel implementation should be relatively easy and it would result in a significant economy in both time and resources.

Question: Has anyone tried to parallelize MI in mice package, either externally (via R parallel facilities), or internally (by modifying the source code) and what are results, if any? Thank you!

slamballais
  • 3,161
  • 3
  • 18
  • 29
Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64
  • 1
    Try the R `Hmisc` package `aregImpute` function which is faster. To start out, specify that all effects of continuous variables are linear (which is what `mice` assumes). – Frank Harrell Sep 30 '14 at 11:35
  • 1
    @FrankHarrell: I appreciate your advice, Dr. Harrell! I took a quick look at the suggested function's description in the `Hmisc` documentation. The description seems a little overwhelming to me (probably, due to my limited knowledge of the subject matter), plus I noticed that `mice` now has `fastpmm` method, which, I guess, is equal to the corresponding one in `aregImpute()`. Even if I won't end up using `aregImpute()`, I'm sure that I will use some other functions from your amazing `Hmisc` package. Best wishes, Alex. – Aleksandr Blekh Oct 01 '14 at 03:23

1 Answers1

2

Recently, I've tried to parallelize multiple imputation (MI) via mice package externally, that is, by using R multiprocessing facilities, in particular parallel package, which comes standard with R base distribution. Basically, the solution is to use mclapply() function to distribute a pre-calculated share of the total number of needed MI iterations and then combine resulting imputed data into a single object. Performance-wise, the results of this approach are beyond my most optimistic expectations: the processing time decreased from 1.5 hours to under 7 minutes(!). That's only on two cores. I've removed one multilevel factor, but it shouldn't have much effect. Regardless, the result is unbelievable!

Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64
  • These results sound impressive, but unless you have about 12 cores it is a little [too good to be true](http://en.wikipedia.org/wiki/Parallel_computing#Amdahl.27s_law_and_Gustafson.27s_law). There is some overhead associated with parallelization and therefore these kinds of gains are highly likely a result from some other contributing factor. Anyway, I've written a short how-to for doing this related question http://stackoverflow.com/questions/24040280/parallel-computation-of-multiple-imputation-by-using-mice-r-package – Max Gordon Nov 23 '14 at 10:39
  • @MaxGordon: Thank you for the comment. The *results* that I've shared indeed **are true**. I hope that you're not accusing me in not being truthful, which 1) is not my style; 2) I don't have reasons to lie. Whether that **particular** speedup was a result of only using `mclapply()` or that plus some *unaccounted factors* is unclear to me at this point. Currently, I have more urgent tasks to take care of. Having said that, your references look interesting and I will definitely review them in detail when I will have more time. – Aleksandr Blekh Nov 23 '14 at 18:39
  • 1
    Sorry, of course you are truthful. My point was simply that most people should not expect that kind of speedup if they have the common setup with 4-8 cores. My guess is that there may be some factor that changed between the two, I'm not sure why mclapply would do that by itself. – Max Gordon Nov 23 '14 at 20:16
  • @MaxGordon: No problem and thank you! I agree with you on this. FYI, that speedup has occurred on an `m3.large` Amazon EC2 instance, which is only 2-core. By the way, I took a quick look at your interesting blog and nice `gmisc` package. Two functions look currently especially appealing to me: `htmlTable` and `getDescriptionsStatsBy`. I might even try them at some point. Best wishes, Alex. – Aleksandr Blekh Nov 24 '14 at 01:24