3

I am trying to permute a matrix by columns using R. However, it is taking ages (the matrix is 68k x 32k integers).

I would like to do it in parallel (since each column is permuted independently). How can I achieve it? It should be something related to embarrassingly parallel for in R but I didn't find a solution.

Currently, my function is the following:

permMTX <- function(x) {
    nr <- nrow(x)
    nc <- ncol(x)
    # I'd like to parallelize this for, since each
    # column can be permuted independently
    for (i in 1:nc) {
        x[,i] <- x[sample(nr),i]
    }
    x
} 
gc5
  • 9,468
  • 24
  • 90
  • 151
  • Is `x` really meant to be a `matrix`? You named the function `permDF`, so I just wanted to make sure it's not a `data.frame` – catastrophic-failure Feb 16 '18 at 17:43
  • 1
    Yep, I'll correct the name of the function. Thanks. – gc5 Feb 16 '18 at 17:44
  • Different take: After doing `x_p <- permMTX(x)` what calculations/operations are you going do on `x_p`? – HenrikB Feb 17 '18 at 20:49
  • @HenrikB a lot of operations, actually - mainly PCA I'd say, why? – gc5 Feb 17 '18 at 22:04
  • 2
    Each time you permute, you create a new 8 GiB matrix. If you permute B times and B is large, that's a lot of memory allocations (and deallocation work for the garbage collector). If you calculate a summary statistics at the end, sometimes it's possible to rewrite the suite of operations ("postpone permutation") so that you can avoid creating B matrices. It's not simple, not obvious, but is some cases it's possible. – HenrikB Feb 17 '18 at 22:34
  • 2
    More to think abount: To make sure you get truly random samples when doing parallel processing, you have to utilize parallel-safe random number generation (RNG). R provides `RNGkind("L'Ecuyer-CMRG")` and `parallel::nextRNGStream` etc. for this. Unfortunately, that will introduce additional overhead to whatever overhead you have from the parallel orchestration itself. – HenrikB Feb 17 '18 at 22:42

2 Answers2

4

Solution

First, I'd make use of vectorization, which should make it more efficient.

permMTX = function(x) apply(x, 2L, sample)

Then we can use library parallel to parallelize that function:

library(parallel)

parPermMTX = function(x, cluster) parApply(cl = cluster, X = x, MARGIN = 2L, FUN = sample)

Usage

With parallel you have to register a cluster before usage. Here's an example:

cl = makeCluster(detectCores(logical = FALSE))
parPermMTX(diag(10), cl)
#     [,1] [,2] [,3] [,4] [,5]
#[1,]    0    1    0    0    0
#[2,]    0    0    0    0    0
#[3,]    0    0    0    0    0
#[4,]    1    0    0    1    1
#[5,]    0    0    1    0    0

The way parallel works (spawning multiple R processes) you have to assure that you have enough memory to fit multiple copies of your data as well.

I think it's recommended to export the data to the processes as well, you can do that simply calling

clusterExport(cl, varlist = "exampleData")

While it does run in parallel on my end, it's not faster at all than simply employing apply, but I couldn't test with data at the same scale as yours, so I can't be sure it'll work.

This is due to the fact sample is heavily optimized already, so the overhead of spawning processes is bigger than simply calling sample. See Why is the parallel package slower than just using apply?

On my system, sampling 68E3 integers 32E3 times takes roughly 40 seconds:

microbenchmark(sample(68E3), times = 32E3)
#Unit: milliseconds
#          expr      min       lq     mean   median       uq      max neval
# sample(68000) 1.132273 1.192923 1.290838 1.227912 1.286229 7.880191 32000

Perhaps you're running out of memory, and using the hard disk cache, which is really slow.


2nd Solution

So, what if we tried to allocate as many calls to sample sequentially to a single process? This is what I tried here:

parPermMTX2 = function(x, cluster) do.call(cbind, parLapply(cl = cluster, X = list(x[,seq(floor(ncol(x)/2))], x[,floor(ncol(x)/2)+seq(ceiling(ncol(x)/2))]), fun = permMTX))

We divide x into two halves, then call permMTX in each, then recombine with cbind.

Sadly, neither this way I could achieve better performance. So, while I answered your question, I'm not sure it's any help at all.

Community
  • 1
  • 1
catastrophic-failure
  • 3,759
  • 1
  • 24
  • 43
  • Very interesting answer. I think I am going to test with vectorized function if my memory fails to keep the copies of the matrix. Thanks – gc5 Feb 16 '18 at 19:24
  • @gc5 If you found another answer more useful than mine you can unaccept mine and accept theirs, just letting you know :) – catastrophic-failure Feb 18 '18 at 11:54
  • In this case I used your suggestion since it's the simplest and reasonably fast. Thanks. – gc5 Feb 18 '18 at 18:05
3

Disclaimer: I'm the author of package bigstatsr.

You can use shared memory (matrices stored on disk) and do this:

# devtools::install_github("privefl/bigstatsr")
library(bigstatsr)

# matrix on disk
mat <- FBM(68e3, 32e2, backingfile = "test")
# inialize with 1:nrow(mat) for each column
system.time(
  big_apply(mat, a.FUN = function(X, ind) {
    print(min(ind))
    X[, ind] <- rep(rows_along(X), length(ind))
    NULL
  }, a.combine = 'c')
) # 15 sec

# permute each column, in parallel
system.time(
  big_apply(mat, a.FUN = function(X, ind) {
    print(min(ind))
    X[, ind] <- apply(X[, ind], 2, sample)
    NULL
  }, a.combine = 'c', ncores = nb_cores())
) # 27 sec

This takes 27 secs on a tenth of the data and 378 sec on the whole dataset (on a laptop computer with only 2 physical cores and 8GB of RAM).

F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • Sound really cool, I'll have to try this. Just one thing, though, you're the author of the package right? It's common etiquette to include a statement or disclaimer saying that, like referring to the package as _"my package"_ or simply writing _"(which is my creation)"_, or anything else to the same effect. No big deal though, just though I'd let you know that. [It's better to disclose such information to avoid being mistook for spam](https://meta.stackoverflow.com/questions/285851/user-promoting-his-software-in-relevant-questions-without-disclosure). – catastrophic-failure Feb 16 '18 at 23:31
  • Nice solution. In this case I don't need shared memory but this approach is really useful. Thanks – gc5 Feb 17 '18 at 22:02