8

I need to select k elements randomly in the range 0 to n-1. n can be upto 10^9. And k can range from 1 to n-1. I can do this in O(n) time just by shuffling an array containing values 0 to n-1 and choosing first k elements from it. But when k is small this method is both time and memory inefficient. Is there any O(k) solution to this problem?

Note : Selected k numbers must be distinct.

I'm thinking of a solution. There are two approaches to this I can think of. Let R is the set to be returned.

  1. Select a random value in the range and add it to R. Keep on doing this until |R| = k. This process takes sum(n/i) for n+1-k <= i <= ntime and O(k) space.
  2. Insert 0 to n-1 in an array, shuffle it, take first k elements from it. This process takes O(n+k) time and space.

So for a given k I can select the preferable method in O(k) time.

crysoberil
  • 271
  • 3
  • 10
  • 1
    Look here, it might help: http://stackoverflow.com/questions/3722430/most-efficient-way-of-randomly-choosing-a-set-of-distinct-integers – MLavrentyev May 31 '15 at 15:30

1 Answers1

9

Modified Fisher-Yates algorithm

The shuffle solution can be improved, since you only have to shuffle the first k elements of the array. But that's still O(n) because the naïve shuffle implementation requires an array of size n, which needs to be initialized to the n values from 0 to n-1.

Initialize value[n] to {0..n-1}
For i from 0 to k-1:
  swap(value[i], value[random_in_range(i, n)])
Result is value[0..k-1]

To improve on that, we can use a kind of virtual array, consisting of two parts:

  1. value: An array of the first k elements, which will be the resulting selection. This is initialized to {0..k-1}

  2. rest: A sparse datastructure (a hashtable, for example) with capacity of k entries, containing all the remaining entries of the array which are not simply their own index. Initially empty.

Now we can define the function which swaps elements i and j from the value array (Note: i<k, as is guaranteed by the above algorithm):

# To swap elements i and j
If j < k:
  # Both elements to be swapped are in the selection
  tmp = value[i]; value[i] = value[j]; value[j] = tmp
Else If j in rest:
  # Element j has been swapped before
  tmp = value[i]; value[i] = rest[j]; rest[j] = tmp
Else:
  # The value at j is still j, we now add it to the virtual array
  rest[j] = value[i]; value[i] = j

That uses O(k) space and time, for any value of kn.

Three algorithm strategy

A simpler solution using O(k) memory is to just keep a hashtable of all the selected values, and generate values until the hashtable contains k values, rejecting duplicates.

For small k, the probability of a randomly-selected element being a duplicate is insignificant and the naïve hashtable is certainly the simplest solution. On the other hand, if k is a significant fraction of n, the hash table is mostly wasted space, since a simple bit vector of size n would be be sufficient to record which values have been seen. And for very large k, the rejection algorithm will take too much time as the sample fills up, and the full vector needed for the shuffle is not much more space than the vector that would be used to hold the sample.

As a result, the pragmatic solution is probably to use whichever of the three solutions uses less space and time: For values of k sufficiently large that the n-bit bitvector is smaller than a hash table with k entries but not so large that the probability of rejection is significant (say, n/64≤kn/4), use the bitvector. For smaller values of k, use the simple hashtable algorithm, and for values of k close to n , use a Fisher-Yates shuffle on the full n-element vector (but limited to k steps).

Since we choose the O(n) strategies only in cases where k>cn for some constant c, the composite algorithm is still O(k) time and space because with that constraint, n is in O(k).

rici
  • 234,347
  • 28
  • 237
  • 341
  • Modified Fisher-Yates algorithm seems perfect. Thanks. – crysoberil May 31 '15 at 15:47
  • 1
    @crysoberil: There is another approach using the famous (but slightly controversial) uninitialized array trick, which is O(k) time and O(n) space, but involves accessing the uninitialized values of the n-element array. Unfortunately, that doesn't work in either Java or standard-compliant C++; in the first case, because it is not possible to fail to initialize an array, and in the second case because the standard insists that uninitialized values are unstable (!) and hence reading them is UB. So I didn't add it to the list. – rici May 31 '15 at 15:51
  • This is better: http://repository.cmu.edu/cgi/viewcontent.cgi?article=3483&context=compsci (Saxe algorithm) – Dave Jun 01 '15 at 02:10
  • 1
    @DaveGalvin: That's an excellent algorithm and probably worthy of being an answer, but it's not 100% clear to me that it is what is being asked for. The Saxe algorithm generates a sorted sequence. To produce a random permutation of the sorted sequence, you'd need to shuffle afterwards, and for all the simplicity of the Saxe algorithm, for k close to n, shuffling iota(n) is easier. (But if what is being looked for is an order-independent sample, then Saxe is probably the way to go.) – rici Jun 01 '15 at 04:03
  • If anyone is looking for the paper referred to in the above comments (whose link is now useless), I believe it is *Generating Random Numbers in Sorted Order* by Jon Louis Bentley and James B. Saxe. A currently working link is https://kilthub.cmu.edu/articles/Generating_sorted_lists_of_random_numbers/6605957/1. The ACM, which originally published the article forty years ago, will sell it to you for an exorbitant sum, which I regard as completely unconscionable. But those are the times we are living through. – rici Aug 17 '20 at 20:08