Here's a simple implementation of Fisher-Yates that takes advantage of the fact that at first the unsampled values form long sequences, so can be compactly encoded. It stores the differences using run-length encoding, only expanding during sampling. Some ideas for efficiency improvements follow:
onDemand <- function(n) {
# Store the remainder of the deck as differences, starting from
# zero, i.e. initially encode deck <- c(0,1,2, ..., n) as
# rle(diff(deck))
# To do a sample, choose an element after the 0 at random,
# swap it with the last entry, and return it.
remaining <- structure(list(lengths = n, values = 1),
class = "rle")
encode <- function(seq) {
rle(diff(c(0, seq)))
}
decode <- function(enc) {
cumsum(inverse.rle(enc))
}
function(m = 1) {
result <- numeric(m)
remaining <- decode(remaining)
nleft <- length(remaining)
for (i in seq_len(m)) {
if (!nleft)
result[i] <- NA
else {
swap <- sample(nleft, 1)
result[i] <- remaining[swap]
remaining[swap] <- remaining[nleft]
nleft <- nleft - 1
}
}
length(remaining) <- nleft
remaining <<- encode(remaining)
result
}
}
Some notes:
If n
is huge (e.g. a million), the RLE will be pretty small for the first few hundred or thousand samples, but the expanded vector will be big. This could be avoided by writing methods to index directly into the encoded vector without expanding it. It's fairly easy to write methods to extract values, but replacing them is messy, so I didn't bother.
After a lot of samples have been taken, it would probably be more efficient just to store the remaining values without encoding them.
Here is a typical run:
> nextval <- onDemand(1000000)
> nextval(100)
[1] 370610 973737 503494 916342 932407 222542 152900 783549
[9] 249178 138066 626285 611692 805466 406702 630680 11850
[17] 29150 19859 516327 513589 900781 923846 620311 886004
[25] 293838 362498 451400 61116 272106 990026 78768 501649
[33] 442166 867620 533579 679138 350663 840548 820948 586161
[41] 5540 399160 583113 298526 382205 920895 25499 450975
[49] 17561 18395 679743 719144 25850 421673 974477 495473
[57] 681210 773482 175615 71834 163729 441219 992938 722924
[65] 374084 769210 759145 923529 11192 752293 953230 96349
[73] 988377 672156 658830 394943 715904 762062 403089 848479
[81] 962312 303000 680417 760521 515682 237871 823706 119516
[89] 978289 985208 437114 620376 940255 399345 221688 59345
[97] 29765 400142 142375 911747
> environment(nextval)$remaining
Run Length Encoding
lengths: int [1:301] 5539 1 1 5650 1 1 656 1 1 5709 ...
values : num [1:301] 1 994421 -994419 1 988741 -988739 1 988136 -988134 1 ...