I have a collection of values and associated probabilities for each value (all positive, but far from being equal). I would like to sample from this distribution many times. One technically correct way to do this is linear search, which would be like this:
def sample(xs,ps,p):
for i in xrange(len(ps)):
if ps[i] <= p: return xs[i]
in Python notation. Then for sampling you would take p=random.random(). This linear search approach is very slow if xs and ps are very large. Accordingly, I want to do something along the lines of binary search. My first idea was to build a binary tree, and sample by traversing the tree using a random bit sequence (go left if a bit is zero, right if it is one). I would build the tree by splitting the list of cumulative sums of probabilities in a way which resembles quicksort: the left part of the tree is the values <= 1/2, the right part of the tree is the values >1/2, and then I recurse, so that the LL part of the tree is the values <= 1/4, the LR part is the values >1/4 but <=1/2, etc.
I actually did implement this on a toy example, where the cumulative sum of the probabilities was [0.26,0.91,0.99,1]. (So 26% of the time you get the first value, 65% of the time you get the second, 8% the third, 1% the last).
I wind up with two problems. One problem I already fixed: there are certain nodes that only have one child based on the sorting mechanism above, such as the first movement to the right in my example. This was easy enough to fix: I just don't change the tree and appropriately update the splitting mechanism, applying it to what I already have.
But by doing it that way, all values that are located at a given level of the tree become equally likely. So the first value corresponds to L (1/2), the second value corresponds to RL (1/4), the third value corresponds to RRL (1/8), and the last value corresponds to RRR (1/8). These are very different from the probabilities that I wanted!
So my question is: how can I build an efficient data structure and traversal algorithm for a sampling procedure like the one above?