21

I want to choose k elements uniformly at random out of a possible n without choosing the same number twice. There are two trivial approaches to this.

  1. Make a list of all n possibilities. Shuffle them (you don't need to shuffle all n numbers just k of them by performing the first k steps of Fisher Yates). Choose the first k. This approach takes O(k) time (assuming allocating an array of size n takes O(1) time) and O(n) space. This is a problem if k is very small relative to n.
  2. Store a set of seen elements. Choose a number at random from [0, n-1]. While the element is in the set then choose a new number. This approach takes O(k) space. The run-time is a little more complicated to analyze. If k = theta(n) then the run-time is O(k*lg(k))=O(n*lg(n)) because it is the coupon collector's problem. If k is small relative to n then it takes slightly more than O(k) because of the probability (albeit low) of choosing the same number twice. This is better than the above solution in terms of space but worse in terms of run-time.

My question:

is there an O(k) time, O(k) space algorithm for all k and n?

Benjy Kessler
  • 7,356
  • 6
  • 41
  • 69
  • I don't see how you can avoid `O(n)` space if you include the space occupied by the input array. You have `n` elements, and the space needed to store it will grow with `n`. – Asad Saeeduddin Apr 25 '15 at 18:03
  • You don't need to store the input array. Just the number n and the k element selected. – Benjy Kessler Apr 25 '15 at 18:04
  • So, given numbers n and k, how can your subroutine choose k elements uniformly at random from a set of n elements that it isn't given? – Asad Saeeduddin Apr 25 '15 at 18:10
  • Easy, the function rand() chooses a number in [0, RAND_MAX] without storing an array of size RAND_MAX. – Benjy Kessler Apr 25 '15 at 18:11
  • I guess I'm not understanding how that helps you choose elements from a set. Choose 5 elements from the set {1,2,3,25,107,94}. The minimum space required by a program that solves this problem grows with the size of this arbitrary set I've given you. If I grow the set, you need more space, no? I mean, how could you conceivably solve the problem without storing the set itself? – Asad Saeeduddin Apr 25 '15 at 18:17
  • 1
    I don't want an algorithm that works with any set. Just the set of numbers [0, n]. – Benjy Kessler Apr 25 '15 at 18:21
  • Oh, duh. That makes a lot more sense. – Asad Saeeduddin Apr 25 '15 at 18:24
  • What is this problem called? Does anyone know? – Tyler Cloutier May 26 '18 at 00:44
  • Possible duplicate of [Algorithm to select a single, random combination of values?](https://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-random-combination-of-values) – user202729 May 29 '18 at 14:35

3 Answers3

16

With an O(1) hash table, the partial Fisher-Yates method can be made to run in O(k) time and space. The trick is simply to store only the changed elements of the array in the hash table.

Here's a simple example in Java:

public static int[] getRandomSelection (int k, int n, Random rng) {
    if (k > n) throw new IllegalArgumentException(
        "Cannot choose " + k + " elements out of " + n + "."
    );

    HashMap<Integer, Integer> hash = new HashMap<Integer, Integer>(2*k);
    int[] output = new int[k];

    for (int i = 0; i < k; i++) {
        int j = i + rng.nextInt(n - i);
        output[i] = (hash.containsKey(j) ? hash.remove(j) : j);
        if (j > i) hash.put(j, (hash.containsKey(i) ? hash.remove(i) : i));
    }
    return output;
}

This code allocates a HashMap of 2×k buckets to store the modified elements (which should be enough to ensure that the hash table is never rehashed), and just runs a partial Fisher-Yates shuffle on it.

Here's a quick test on Ideone; it picks two elements out of three 30,000 times, and counts the number of times each pair of elements gets chosen. For an unbiased shuffle, each ordered pair should appear approximately 5,000 (&pm;100 or so) times, except for the impossible cases where both elements would be equal.

Community
  • 1
  • 1
Ilmari Karonen
  • 49,047
  • 9
  • 93
  • 153
  • Thanks, can you give a bit more detail. Do you know if there is a standard implementation of this in Java? – Benjy Kessler Apr 25 '15 at 17:51
  • See edit above. I'm not aware of a standard implementation, but it's easy enough to write one. – Ilmari Karonen Apr 25 '15 at 18:44
  • Are you sure each element is selected uniformly at random? Won't this be biased towards smaller numbers? – Asad Saeeduddin Apr 25 '15 at 18:52
  • @Asad: It works exactly like the Fisher-Yates shuffle (except with a different way of storing the data being shuffled), so the same proof of correctness applies (assuming the implementation isn't buggy, that is). Anyway, feel free to run some tests yourself to check for bias. The three-element shuffle test I did [here](http://ideone.com/kSAmbo) is generally pretty good at catching most biased shuffles, but there are other tests you could do (e.g. sampling one or two elements from a large range) if you suspect some bias that it might not detect. – Ilmari Karonen Apr 25 '15 at 18:59
  • I've tried with all kinds of bucket sizes and k, n combinations; definitely uniform. – Asad Saeeduddin Apr 25 '15 at 19:28
  • I just want to commend you on a pretty O(*k*) algorithm. – Sean Allred Apr 25 '15 at 21:39
  • When *k* gets large enough for the hash table to be bigger than an array of length *n*, the Fisher-Yates algorithm should always win (less space, no hashing, no indirections, no collisions, no reselections). – Raymond Hettinger Apr 25 '15 at 22:14
  • Beautiful solution! works for taking k items of a array/list too – The Scrum Meister May 12 '20 at 16:10
2

Your second approach does not take Theta(k log k) time on average, it takes about n/(n-k+1) + n/(n-k+2) + ... + n/n operations, which is less than k(n/(n-k)) since you have k terms which are each smaller than n/(n-k). For k <= n/2, it takes under 2*k operations on average. For k>n/2, you can choose a random subset of size n-k, and take the complement. So, this is already an O(k) average time and space algorithm.

Douglas Zare
  • 3,296
  • 1
  • 14
  • 21
  • Very interesting observation. The question mentions _O(k log k)_ though, not _Θ(k log k)_, and makes no assumptions about _k_ or the complement-optimization for _k > n/2_. Also, it's worth mentioning that you are assuming an _O(1)_ set membership test (this _might_ have been the OP's assumption as well, but it's hard to tell). – kyrill May 31 '20 at 04:38
0

What you could use is the following algorithm (using javascript instead of pseudocode):

var k = 3;
var n = [1,2,3,4,5,6];

// O(k) iterations
for(var i = 0, tmp; i < k; ++i) {

    // Random index O(1)
    var index = Math.floor(Math.random() * (n.length - i));

    // Output O(1)
    console.log(n[index]);

    // Swap and lookup O(1)
    tmp = n[index];
    n[index] = n[n.length - i - 1];
    n[n.length - i - 1] = tmp;
}

In short, you swap the selected value with the last item and in the next iteration sample from the reduced subset. This assumes your original set is wholly unique.

The storage is O(n), if you wish to retrieve the numbers as a set, just refer to the last k entries from n.

Gerard
  • 831
  • 6
  • 15