3

Suppose that I have a 20 X 5 matrix, I would like to select subsets of the matrix and do some computation with them. Further suppose that each sub-matrix is 7 X 5. I could of course do

ncomb <- combn(20, 7)

which gives me all possible combinations of 7 row indices, and I can use these to obtain sub-matrices. But with a small, 20 X 5 matrix, there are already 77520 possible combination. So I would like to instead randomly sample some of the combinations, e.g., 5000 of them.

One possibility is the following:

ncomb <- combn(20, 7)
ncombsub <- ncomb[, sample(77520, 5000)]

In other words, I obtain all possible combinations, and then randomly select only 5000 of the combinations. But I imagine it would be problematic to compute all possible combinations if I had a larger matrix - say, 100 X 7.

So I wonder if there is a way to get subsets of combinations without first obtaining all possible combinations.

Alex
  • 4,030
  • 8
  • 40
  • 62
  • 1
    Yes, I think this is possible by modifying `combn` or writing your own function (which might be easier). Coming up with an algorithm for this and implementing it shouldn't be hard. – Roland Aug 17 '13 at 19:03
  • 1
    You may want to see related post [here](http://stackoverflow.com/questions/4493287/generating-a-very-large-matrix-of-string-combinations-using-combn-and-bigmemor) – Metrics Aug 17 '13 at 19:09
  • @Roland I ended up modifying `combn()` as you suggested. Works well. – Alex Aug 18 '13 at 17:55

2 Answers2

3

Your approach:

op <- function(){
    ncomb <- combn(20, 7)
    ncombsub <- ncomb[, sample(choose(20,7), 5000)]
    return(ncombsub)
}

A different strategy that simply samples seven rows from the original matrix 5000 times (replacing any duplicate samples with a new sample until 5000 unique row combinations are found):

me <- function(){
  rowsample <- replicate(5000,sort(sample(1:20,7,FALSE)),simplify=FALSE)
  while(length(unique(rowsample))<5000){
     rowsample <- unique(rowsample)
     rowsample <- c(rowsample,
                    replicate(5000-length(rowsample),
                              sort(sample(1:20,7,FALSE)),simplify=FALSE))
  }
  return(do.call(cbind,rowsample))
}

This should be more efficient because it prevents you from having to calculate all of the combinations first, which will get costly as the matrix gets larger.

And yet, some benchmarking reveals that is not the case. At least on this matrix:

library(microbenchmark)
microbenchmark(op(),me())

Unit: milliseconds
 expr      min       lq   median      uq      max neval
 op() 184.5998 201.9861 206.3408 241.430 299.9245   100
 me() 411.7213 422.9740 429.4767 474.047 490.3177   100
Thomas
  • 43,637
  • 12
  • 109
  • 140
  • A couple of issues. For your code to work, I think you need to also sort each column before the while loop, that is, sorting each sample of indices. Otherwise, `unique()` won't work. A second issue I think is that the argument `MARGIN` of `unique()` needs to be set to `2` (it's `1` by default). Also instead of `length(unique(rowsample))`, it needs to be `ncol(unique(rowsample))`. Since `length` gives you the total number of elements contained in a `matrix`, not the number of columns (in my case, each column is a sample, so 5000 columns are 5000 samples of indices). – Alex Aug 17 '13 at 22:48
  • @Alex Made some changes (was thinking of `replicate` returning a list, not a matrix). Turns out it's not as efficient as your original solution. And, if you allow `replicate` to simplify to a matrix, it is even slower. – Thomas Aug 18 '13 at 07:27
  • I ended up modifying the original `combn()` function, and byte-compiling it. It works fine. But thanks for this solution anyway, I think your strategy can be useful for some other things I am working on. – Alex Aug 18 '13 at 18:14
3

I ended up doing what @Roland suggested, by modifying combn(), and byte-compiling the code:

combn_sub <- function (x, m, nset = 5000, seed=123, simplify = TRUE, ...) {
    stopifnot(length(m) == 1L)
    if (m < 0) 
        stop("m < 0", domain = NA)
    if (is.numeric(x) && length(x) == 1L && x > 0 && trunc(x) == 
        x) 
        x <- seq_len(x)
    n <- length(x)
    if (n < m) 
        stop("n < m", domain = NA)
    m <- as.integer(m)
    e <- 0
    h <- m
    a <- seq_len(m)
    len.r <- length(r <-  x[a] )
    count <- as.integer(round(choose(n, m)))
    if( count < nset ) nset <- count
    dim.use <- c(m, nset)       

    ##-----MOD 1: Change the output matrix size--------------
    out <- matrix(r, nrow = len.r, ncol = nset) 

    if (m > 0) {
        i <- 2L
        nmmp1 <- n - m + 1L

        ##----MOD 2: Select a subset of indices
        set.seed(seed)
        samp <- sort(c(1, sample( 2:count, nset - 1 )))  

        ##----MOD 3: Start a counter.
        counter <- 2L    

        while (a[1L] != nmmp1 ) {
            if (e < n - h) {
                h <- 1L
                e <- a[m]
                j <- 1L
            }
            else {
                e <- a[m - h]
                h <- h + 1L
                j <- 1L:h
            }
            a[m - h + j] <- e + j

            #-----MOD 4: Whenever the counter matches an index in samp, 
            #a combination of row indices is produced and stored in the matrix `out`
            if(samp[i] == counter){ 
                out[, i] <- x[a]
                if( i == nset ) break
                i <- i + 1L
            }
            #-----Increase the counter by 1 for each iteration of the while-loop
            counter <- counter + 1L
        }
    }
    array(out, dim.use)
}

library("compiler")
comb_sub <- cmpfun(comb_sub)
Alex
  • 4,030
  • 8
  • 40
  • 62