4

Got this question from algorithms design manual by Steven Skiena.

It is required to select k (value given) numbers to form a subset S' from a given set S having n numbers, such that selection probability for each number is equal (k/n). n is unknown (i was thinking of taking S as a link-list for this). also, we can have only pass through the set S.

S..K
  • 1,944
  • 2
  • 14
  • 16

3 Answers3

6

When n is unknown you'd rather need an on-line algorithm for so-called Reservoir sampling.

The good explanation & proof sketches are provided here http://propersubset.com/2010/04/choosing-random-elements.html

I mean this algorithm implemented in Python (taken from the link above)

import random
def random_subset( iterator, K ):
    result = []
    N = 0

    for item in iterator:
        N += 1
        if len( result ) < K:
            result.append( item )
        else:
            s = int(random.random() * N)
            if s < K:
                result[ s ] = item

    return result
Alec
  • 1,486
  • 2
  • 14
  • 30
2

Something like this

for elem in S
  if random() < (k - S'.size)/S.size // This is float division
    S'.add(elem)

The first element is chosen with probability k/n, the second one with (n-k)/n * k/(n-1) + k/n * (k-1)/(n-1) which reduces to k/n, etc.

Paweł Obrok
  • 22,568
  • 8
  • 74
  • 70
  • 1
    According to the formula you've provided you should take into account the index i of elem (starting with 1), hence having random() < (k - S'.size)/(S.size - i) // Float division – Alec May 04 '12 at 13:22
  • 1
    This answer doesn't work if S.size is unknown, which is a condition given in the original question – xdavidliu Sep 05 '16 at 22:42
  • That's right. I can't delete, since it's accepted, but it seems the other answer is correct. – Paweł Obrok Sep 05 '16 at 23:14
1

You should choose an algorithm that can truly simulate the real activity "Randomly choose k numbers from n numbers".Your algorithm should has two properties

(1) It must return k numbers at end.

(2) It must truly simulate that properties of target activity : each number is selected with probability k/n.

Oboroks answer is wrong because it hasnt first property.

for i = 0 to n
randomly choose an integer number between [1,n-i+1]
if [randomValue <= (k - S'.size)/(S.size - i + 1)] then
    S'.add(S[i])

With above selecting plan,each number is selected with probability k/n.You can see that by prove below equation :

https://www.facebook.com/photo.php?fbid=677984275648191&l=7cafe5d468