I want to sample the output of combn(x,N) randomly without replacement. Is it possible without generating calling combn first?
I don't think so, not with the current state of 32-bit integers (and bit64
, even as good as it is, doesn't catch everything).
Case in point: in order to be able to arbitrarily index the set returned by combn(10000,4)
, you probably start by determining something as straight-forward/simple as "is the first of my four numbers a '1'". Knowing that the first j
iterations of a combination generator will start with 1 (e.g., 1,2,3,4
, 1,2,3,5
, ..., 1,2,3,10000
), you think "all I need to do is check my desired index against this first set of 1s and iterate" (looking for 2s, 3s, etc). Unfortunately, with 10k and N=5
, the first 4.162501e+14
rows start with "1". (This happens to be choose(10000-1,5-1)
, which is not a coincidence.) You then have to do this again and again, and the count just goes higher.
This is well out of 32-bit integer-space. N=6
escalates with 8.320840e+17
as the first set of 1s.
To perform "random access" on this space is rather insane, something where even (I suspect) native 64bit calculations will run out of space fairly quickly.
If you cannot reduce your data size
I believe your most practical route will be to use @alistaire's suggested code in his comment:
set.seed(42)
x <- seq(1, 10000)
N <- 4
M <- 10
out <- list()
while(length(out) < M) {
out <- c(out,
unique(replicate(M - length(out), sort(sample(x, N)), simplify = FALSE)))
}
str(out)
# List of 10
# $ : int [1:4] 2861 8302 9149 9370
# $ : int [1:4] 1347 5191 6418 7365
# $ : int [1:4] 4577 6570 7050 7189
# $ : int [1:4] 2555 4623 9347 9398
# $ : int [1:4] 1175 4750 5602 9783
# $ : int [1:4] 1387 9041 9464 9887
# $ : int [1:4] 825 3902 5142 9055
# $ : int [1:4] 4470 7375 8109 8360
# $ : int [1:4] 40 3882 6852 8327
# $ : int [1:4] 74 2077 6116 9065
Or a slight adaptation (can be ~30% faster with much larger N
,M
):
set.seed(42)
N <- 8
M <- 100
out <- list()
while (length(out) < M) {
out2 <- split(apply(matrix(sample(lenx, size = M*N, replace = TRUE),
nrow = M, ncol = N),
1, sort),
rep(1:M, each = N))
out <- c(out, out2[ !duplicated(out2) ])
}
If you know that M*N < length(x)
, then you can use replace=FALSE
instead and you should be guaranteed a single pass through the while
loop.
Even if you can significantly reduce your sample size
I've written a function that provides random access to combinations. However, the more I test it, the more I see that when it starts breaking down, it will do so without ensuring complete uniqueness of the indices and without a guarantee of erring when this happens. (So I'm not posting it. I can provide it offline if somebody is really curious. I did something similar with a lazy expand.grid
, but that was mathematically much simpler/tractable; and even then I haven't tested it with sets this large. Since you are looking for combinations, not permutations, I don't think it fits here.)
Bottom line: R may not be the place for this, unfortunately.