4

I understand similar questions have been asked like

Choose m elements randomly from a vector containing n elements

Pick N items at random from sequence of unknown length

But the more I look the more I got confused.

Uniformly and randomly choose M elements from N elements

So I need to pick M elements from N ones. And also I need to make the probability of being selected to be uniformly distributed for every element: M/N

My intuition was

  1. Randomly choose an element
  2. Get it out
  3. Repeat the process on the rest of elements

I guess this solution is wrong? The probabilities for the M selected elements are 1/N, 1/(N-1), ..., 1/(N-M), not M/N, am I correct?


A possible correct solution is that

  1. Just do a shuffle on the whole N elements
  2. Pick the top M ones.

But I can't figure out the probability of being selected for each element.

Can anyone demonstrate the probability of this solution is M/N indeed?


Also of course we can use Reservoir sampling, and its probability is M / N.

Community
  • 1
  • 1
Jackson Tale
  • 25,428
  • 34
  • 149
  • 271

3 Answers3

3

I think your confusion there may be that you're not discriminating between choosing a sequence, from choosing a set.

In your first procedure, just because you only have a 1/N chance of choosing a particular element in the first round, doesn't mean you won't choose it in a subsequent round. The element has a 1/N chance of being the first element in the result... but it has a M/N chance of being chosen during some round. So that works. Take M=2, N=4: The chance of an element being picked is 1/4 + (3/4)*(1/3) = 2/4.

As for your second procedure, following the shuffle, each element's position within the array is uniformly distributed, so there's a M/N chance that its position is equal to or less than M (and is hence chosen). So that works too.

Sneftel
  • 40,271
  • 12
  • 71
  • 104
  • My another confusion is the term `uniformly`. Do I expect each element has `1/N` chance to be selected or `M/N`? – Jackson Tale Apr 03 '15 at 13:55
  • 1
    Element `A` has a `1/N` chance of being selected during round `i`, and a `M/N` *overall* chance of being selected during *any* round. – Sneftel Apr 03 '15 at 13:57
  • ah, ok. You mean `A` has `1/N` chance to be 1st, for the 2nd, it has `1/(N-1)` * `(N-1)/N` = `1/N` too, because `(N-1)/N` is the chance it is not selected for the first. So on so forth, `A` has the chance of `M/N` to be any of the selection. correct? – Jackson Tale Apr 03 '15 at 14:00
  • Yep. As the number of rounds grows, the fractions get longer, but the math still works out. – Sneftel Apr 03 '15 at 14:09
3

The probabilities for elements being picked:

First: 1/N;
Second: 1/(N-1) * (1 - 1/N) = 1/N
...

where (1 - 1/N) is the probability of the second item not being picked at the first step, which is the condition for it to be picked at the second step. Here 1/N is the probability of some element being selected at the first step, and (1 - 1/N) - probability of element selected at the second step being available for selection at the second step.

Third: 1/(N-2) * (1 - 1/N - 1/N) = 1/N

where (1 - 1/N - 1/N) is the probability of an element being available for selection at the third step.

And so on.

The point here is that for any element the probability of selection at a step is 1/N.

Igor
  • 15,833
  • 1
  • 27
  • 32
2

Whereas the shuffle method works, there is another way that doesn't require modifying the original array. In fact, you don't even need to know how many total items there are. You can pick M items at random from a stream, and the only data you have to save is an array of those M items.

The basic idea is an extension of the algorithm for picking a single item from a stream of unknown length. Again, you don't know how many items are in the stream. So you always have a selected item, but you might change the selected item at any time.

You start by having no selected item, and a count of 0. When the first item comes in, you increase the count. Then you pick a random number in the range 0 to count-1. If the value is 0, then you make the current item the selected item. The first random number picked has to be 0 because your range is 0 to 0.

As you read each item, you increase the count, pick the random number, and make the current item the selected item if the random number picked is 0. It looks like this:

selected_item = none
count = 0
for each item
    count = count + 1
    if (random(count) == 0)
        selected_item = current_item

This works because the chance of selecting the current item decreases as each item is read. That is, the first item is selected with probability 1/1. When the second item comes in, there is a 1/2 chance that you will select it to replace the first item. When you receive the third item, there is a 1/3 chance that you will replace the currently selected item with the new item.

When you reach the end of the stream, you will given every item an equal chance at having been selected.

You can extend that to multiple items pretty easily. You start by selecting the first M items that come in, and placing them in your selected items array. Then, whenever a new item comes in you pick a random number between 0 and count-1, and if that number is less than M, then you randomly replace one of the selected items with the new item. It looks something like this:

selected_items = array of M items
count = 0
for each item
    if (count < M)
        selected_items[count] = current_item
    else
        if (random(count) < M)
            // pick a random number from 0 to M-1
            // and replace that item with the new item
            ix = random(M)
            selected_items[ix] = current_item

I wrote more detail about this in my blog, Random selection.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351