1

Imagine a standard permute function that takes an integer and returns a vector of the first N natural numbers in a random permutation. If you only need k (<= N) of them, but don't know k beforehand, do you still have to perform a O(N) generation of the permutation? Is there a better algorithm than:

for x in permute(N):
    if f(x):
        break

I'm imagining an API such as:

p = permuter(N)
for x = p.next():
    if f(x):
        break

where the initialization is O(1) (including memory allocation).

Dijkstra
  • 2,490
  • 3
  • 21
  • 35
  • 1
    So the question is "how to choose k random numbers from 1-N without repetition" ? – Joni Jun 01 '17 at 15:48
  • In k steps of O(1) each. – Dijkstra Jun 01 '17 at 15:52
  • Is amortized O(1) aceptable? – Joni Jun 01 '17 at 15:54
  • If that's the best possible :) – Dijkstra Jun 01 '17 at 15:59
  • 1
    [Generating m distinct random numbers in the range \[0..n-1\]](https://stackoverflow.com/questions/6947612/generating-m-distinct-random-numbers-in-the-range-0-n-1). Also [Unique (non-repeating) random numbers in O(1)?](https://stackoverflow.com/questions/196017/unique-non-repeating-random-numbers-in-o1) – Bernhard Barker Jun 01 '17 at 16:26
  • You can just do `k` iterations of Fisher-Yates shuffle for `O(k)` with `O(n)` preprocessing (filling the initial array), but given that you have to output `k` elements as the result, I don't think you can do better than that. – beaker Jun 01 '17 at 16:35
  • 1
    If k << n then it is much more peformant to just generate a random number in the possible range and add it to a hash set, on collision you just generate a new random number. – maraca Jun 01 '17 at 18:28

1 Answers1

3

This question is often viewed as a choice between two competing algorithms:

  • Strategy FY: A variation on the Fisher-Yates shuffle where one shuffle step is performed for each desired number, and

  • Strategy HT: Keep all generated numbers in a hash table. At each step, random numbers are produced until a number which is not in the hash table is found.

The choice is performed depending on the relationship between k and N: if k is sufficiently large, the strategy FY is used; otherwise, strategy HT. The argument is that if k is small relative to n, maintaining an array of size n is a waste of space, as well as producing a large initialization cost. On the other hand, as k approaches n more and more random numbers need to be discarded, and towards the end producing new values will be extremely slow.

Of course, you might not know in advance the number of samples which will be requested. In that case, you might pessimistically opt for FY, or optimistically opt for HT, and hope for the best.

In fact, there is no real need for trade-off, because the FY algorithm can be implemented efficiently with a hash table. There is no need to initialize an array of N integers. Instead, the hash-table is used to store only the elements of the array whose values do not correspond with their indices.

(The following description uses 1-based indexing; that seemed to be what the question was looking for. Hopefully it is not full of off-by-one errors. So it generates numbers in the range [1, N]. From here on, I use k for the number of samples which have been requested to date, rather than the number which will eventually be requested.)

At each point in the incremental FY algorithm a single index r is chosen at random from the range [k, N]. Then the values at indices k and r are swapped, after which k is incremented for the next iteration.

As an efficiency point, note that we don't really need to do the swap: we simply yield the value at r and then set the value at r to be the value at k. We'll never again look at the value at index k so there is no point updating it.

Initially, we simulate the array with a hash table. To look up the value at index i in the (virtual) array, we see if i is present in the hash table: if so, that's the value at index i. Otherwise the value at index i is i itself. We start with an empty hash table (which saves initialization costs), which represents an array whose value at every index is the index itself.

To do the FY iteration, for each sample index k we generate a random index r as above, yield the value at that index, and then set the value at index r to the value at index k. That's exactly the procedure described above for FY, except for the way we look up values.

This requires exactly two hash-table lookups, one insertion (at an already looked-up index, which in theory can be done more quickly), and one random number generation for each iteration. That's one more lookup than strategy HT's best case, but we have a bit of a saving because we never need to loop to produce a value. (There is another small potential saving when we rehash because we can drop any keys smaller than the current value of k.)

As the algorithm proceeds, the hash table will grow; a standard exponential rehashing strategy is used. At some point, the hash table will reach the size of a vector of N-k integers. (Because of hash table overhead, this point will be reached at a value of k much less than N, but even if there were no overhead this threshold would be reached at N/2.) At that point, instead of rehashing, the hash is used to create the tail of the now non-virtual array, a procedure which takes less time than a rehash and never needs to be repeated; remaining samples will be selected using the standard incremental FY algorithm.

This solution is slightly slower than FY if k eventually reaches the threshold point, and it is slightly slower than HT if k never gets big enough for random numbers to be rejected. But it is not much slower in either case, and if never suffers from pathological slowdown when k has an awkward value.

In case that was not clear, here is a rough Python implementation:

from random import randint
def sampler(N):
  k = 1
  # First phase: Use the hash
  diffs = {}

  # Only do this until the hash table is smallish (See note)
  while k < N // 4:
    r = randint(k, N)
    yield diffs[r] if r in diffs else r
    diffs[r] = diffs[k] if k in diffs else k
    k += 1
  # Second phase: Create the vector, ignoring keys less than k
  vbase = k
  v = list(range(vbase, N+1))
  for i, s in diffs.items():
    if i >= vbase:
      v[i - vbase] = s
  del diffs
  # Now we can generate samples until we hit N
  while k <= N:
    r = randint(k, N)
    rv = v[r - vbase]
    v[r - vbase] = v[k - vbase]
    yield rv
    k += 1

Note: N // 4 is probably pessimistic; computing the correct value would require knowing too much about hash-table implementation. If I really cared about speed, I'd write my own hash table implementation in a compiled language, and then I'd know :)

rici
  • 234,347
  • 28
  • 237
  • 341