14

Suppose I have a list, called elements, each of which does or does not satisfy some boolean property p. I want to choose one of the elements that satisfies p by random with uniform distribution. I do not know ahead of time how many items satisfy this property p.

Will the following code do this?:

pickRandElement(elements, p)
     randElement = null
     count = 0
     foreach element in elements
          if (p(element))
               count = count + 1
               if (randInt(count) == 0)
                    randElement = element

     return randElement

(randInt(n) returns a random int k with 0 <= k < n.)

Paul Reiners
  • 8,576
  • 33
  • 117
  • 202
  • I would have thought "by random" and "with equal distribution" were mutually exclusive, what am I missing? – Binary Worrier Jun 08 '09 at 18:01
  • @Binary: he simply means that it has to be a fair random number. All elements that satisfy p must have an equal chance of being randomly picked every time. If this is true, then they will be drawn with equal distribution given enough time. – JoeCool Jun 08 '09 at 18:08
  • 1
    Random distributions can have all sorts of shapes that might be weighted towards one group of elements or another. Here Paul's asking about an even (or uniform) distribution where each element has the same probability of being selected. – Nate Kohl Jun 08 '09 at 18:10
  • @Nate, it's not immediately clear that each element has equal probability. My read was that the group of P and the group of Not-P had equal probability. If each element is uniformly likely, my answer is pretty far off.... ;-) – Bob Cross Jun 08 '09 at 18:18
  • No, the chance of being in p for any element is not necessarily 1/2. – Paul Reiners Jun 08 '09 at 18:23
  • 3
    There's no need to use floats. Personally, I'd just select a random integer between 0 and count - 1, then choose the element if the result is 0. This is more efficient and will protect you from rounding errors. – Peter Ruderman Jun 08 '09 at 18:23
  • That's true about not needing to use floats. That just occurred to me. Thanks. – Paul Reiners Jun 08 '09 at 18:33

5 Answers5

14

It works mathematically. Can be proven by induction.

Clearly works for n = 1 element satisfying p.

For n+1 elements, we will choose the element n+1 with probability 1/(n+1), so its probability is OK. But how does that effect the end probability of choosing one of the prior n elements?

Each of the prior n had a chance of being selected with probability 1/n, until we found element n+1. Now, after finding n+1, there is a 1/(n+1) chance that element n+1 is chosen, so there is a n/(n+1) chance that the previously chosen element remains chosen. Which means its final probability of being the chosen after n+1 finds is 1/n * (n/n+1) = 1/n+1 -- which is the probability we want for all n+1 elements for uniform distribution.

If it works for n = 1, and it works for n+1 given n, then it works for all n.

  • Induction has saved my butt too many times in computation proofs! – JoeCool Jun 08 '09 at 18:40
  • There's actually a simpler way to prove this. For n elements, we will choose the element n with probability 1/n. But what about the prior n-1 elements? Well, under induction we know that all these n-1 elements have the same probability. So the probability for each _must_ be 1/n, since 1/n is the only number that's 1 when multiplied with n. qed :) – FeepingCreature Dec 07 '09 at 17:28
6

Yes, I believe so.

The first time you encounter a matching element, you definitely pick it. The next time, you pick the new value with a probability of 1/2, so each of the two elements have an equal chance. The following time, you pick the new value with a probability of 1/3, leaving each of the other elements with a probability of 1/2 * 2/3 = 1/3 as well.

I'm trying to find a Wikipedia article about this strategy, but failing so far...

Note that more generally, you're just picking a random sample out of a sequence of unknown length. Your sequence happens to be generated by taking an initial sequence and filtering it, but the algorithm doesn't require that part at all.

I thought I'd got a LINQ operator in MoreLINQ to do this, but I can't find it in the repository... EDIT: Fortunately, it still exists from this answer:

public static T RandomElement<T>(this IEnumerable<T> source,
                                 Random rng)
{
    T current = default(T);
    int count = 0;
    foreach (T element in source)
    {
        count++;
        if (rng.Next(count) == 0)
        {
            current = element;
        }            
    }
    if (count == 0)
    {
        throw new InvalidOperationException("Sequence was empty");
    }
    return current;
}
Community
  • 1
  • 1
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 1
    Jon, as far as I see it, this algorithm will ALWAYS pick the first element that meets p, what I am missing? – tekBlues Jun 08 '09 at 18:10
  • @tekBlues: it keeps going after it's picked the first one. – AakashM Jun 08 '09 at 18:13
  • I'm pretty sure this algorithm works, if the random generator does its job properly. – Erich Kitzmueller Jun 08 '09 at 18:14
  • aakashm, he says he only wants to pick one element – tekBlues Jun 08 '09 at 18:22
  • 4
    Instead of "picks", think of the algorithm as "holding" elements. It holds the first one, then might drop it to hold the second one, etc. At the end, you're left holding some element, which is the one that you return. – Nate Kohl Jun 08 '09 at 18:23
  • @tekBlues: It picks the first one, then potentially replaces its pick with the next matching one, then potentially replaces it again etc. Note that it *doesn't* return the moment it's picked one - it returns when it's gone through the whole sequence. – Jon Skeet Jun 08 '09 at 18:33
3

In The Practice of Programming, pg. 70, (The Markov Chain Algorithm) there is a similar algorithm for that:

[...]
  nmatch = 0;
  for ( /* iterate list */ )
    if (rand() % ++nmatch == 0) /* prob = 1/nmatch */
      w = suf->word;
[...]

"Notice the algorithm for selecting one item at random when we don't know how many items there are. The variable nmatch counts the number of matches as the list is scanned. The expression

rand() % ++nmatch == 0

increments nmatch and is then true with probability 1/nmatch."

Nick Dandoulakis
  • 42,588
  • 16
  • 104
  • 136
1

decowboy has a nice proof that this works on TopCoder

Paul Reiners
  • 8,576
  • 33
  • 117
  • 202
0

For clarity's sake, I would try:

pickRandElement(elements, p)
     OrderedCollection coll = new OrderedCollection
     foreach element in elements
          if (p(element))
               coll.add(element)
     if (coll.size == 0) return null
     else return coll.get(randInt(coll.size))

To me, that makes it MUCH clearer what you're trying to do and is self-documenting. On top of that, it's simpler and more elegant, and it's now obvious that each will be picked with an even distribution.

JoeCool
  • 4,392
  • 11
  • 50
  • 66
  • That's the code we have now (and I admit it is more clear). I'm hoping for something more efficient. Creating a list and adding elements to it, I'm guessing, is somewhat inefficient and wasteful, if the proposed alternative will work. – Paul Reiners Jun 08 '09 at 18:19
  • Once you see what the proposed algorithm is doing, it's extremely elegant IMO. – Jon Skeet Jun 08 '09 at 18:34
  • Yes, you're right, if efficiency is the top priority. I would say that the more obvious solution that I provide is more readable though. – JoeCool Jun 08 '09 at 18:38