0

I have a collection of values and associated probabilities for each value (all positive, but far from being equal). I would like to sample from this distribution many times. One technically correct way to do this is linear search, which would be like this:

def sample(xs,ps,p): 
    for i in xrange(len(ps)):
        if ps[i] <= p: return xs[i]

in Python notation. Then for sampling you would take p=random.random(). This linear search approach is very slow if xs and ps are very large. Accordingly, I want to do something along the lines of binary search. My first idea was to build a binary tree, and sample by traversing the tree using a random bit sequence (go left if a bit is zero, right if it is one). I would build the tree by splitting the list of cumulative sums of probabilities in a way which resembles quicksort: the left part of the tree is the values <= 1/2, the right part of the tree is the values >1/2, and then I recurse, so that the LL part of the tree is the values <= 1/4, the LR part is the values >1/4 but <=1/2, etc.

I actually did implement this on a toy example, where the cumulative sum of the probabilities was [0.26,0.91,0.99,1]. (So 26% of the time you get the first value, 65% of the time you get the second, 8% the third, 1% the last).

I wind up with two problems. One problem I already fixed: there are certain nodes that only have one child based on the sorting mechanism above, such as the first movement to the right in my example. This was easy enough to fix: I just don't change the tree and appropriately update the splitting mechanism, applying it to what I already have.

But by doing it that way, all values that are located at a given level of the tree become equally likely. So the first value corresponds to L (1/2), the second value corresponds to RL (1/4), the third value corresponds to RRL (1/8), and the last value corresponds to RRR (1/8). These are very different from the probabilities that I wanted!

So my question is: how can I build an efficient data structure and traversal algorithm for a sampling procedure like the one above?

Ian
  • 153
  • 6
  • Why not simply store them in a list sorted by their cumulative probability, then using binary search to find the right item? – Lasse V. Karlsen Jul 23 '15 at 16:14
  • @LasseV.Karlsen You're right, that does work. It did come to mind, but for some reason I thought it wouldn't work. You can post that as an answer. For future reference, I also found this: https://hips.seas.harvard.edu/blog/2013/03/03/the-alias-method-efficient-sampling-with-many-discrete-outcomes/ which is apparently even faster, at constant time to sample. – Ian Jul 23 '15 at 16:30
  • The question has been closed as a duplicate, no more answers can be provided, but I doubt that would be necessary anyway, given the answers I can see on the linked question. – Lasse V. Karlsen Jul 23 '15 at 16:32
  • I didn't see the duplicate mark when I was writing the comment. Cool stuff, though, I just learned a useful trick! – Ian Jul 23 '15 at 16:43

0 Answers0