Fast weighted random selection from very large set of values

Question

I'm currently working on a problem that requires the random selection of an element from a set. Each of the elements has a weight(selection probability) associated with it.

My problem is that for sets with a small number of elements say 5-10, the complexity (running time) of the solution I was is acceptable, however as the number of elements increases say for 1K or 10K etc, the running time becomes unacceptable.

My current strategy is:

Select random value X with range [0,1)
Iterate elements summing their weights until the sum is greater than X
The element which caused the sum to exceed X is chosen and returned

For large sets and a large number of selections this process begins to exhibit quadratic behavior, in short is there a faster way? a better algorithm perhaps?

You should remove the C++ tag, as this is a general algorithm question applicable to any language. — dkamins, May 19 '11 at 00:44
Thats true, but I'd prefer solutions in C++, as the problem I'm coding is in C++ — , May 19 '11 at 00:45

score 16 · Answer 1 · answered Jul 10 '11 at 14:51

16

You want to use the Walker algorithm. With N elements, there's a set-up cost of O(N). However, the sampling cost is O(1). See

A. J. Walker, An Efficient Method for Generating Discrete Random Variables and General Distributions, ACM TOMS 3, 253-256 (1977).
Knuth, TAOCP, Vol 2, Sec 3.4.1.A.

The RandomSelect class of a RandomLib implements this algorithm.

answered Jul 10 '11 at 14:51

cffk

1,839
1
12
17

4

A bit late to help to OP, but for future readers, this is the right answer. An O(1) algorithm is better than an O(log n) algorithm any day of the week, – David Hammen Jul 10 '11 at 15:55
1

From C++11 on: http://en.cppreference.com/w/cpp/numeric/random/discrete_distribution – leezu Jun 12 '16 at 12:10
can you share an example? :) – Roelant Feb 04 '21 at 16:28
Check Knuth's textbook for a description of the algorithm. The link to RandomLib gives you an implementation. – cffk Feb 05 '21 at 18:31

Keith · Accepted Answer · 2011-05-19T00:59:06.333

Assuming that the element weights are fixed, you can work with precomputed sums. This is like working with the cumulative probability function directly, rather than the density function.

The lookup can then be implemented as a binary search, and hence be log(N) in the number of elements.

A binary search obviously requires random_access to the container of the weights.

Alternatively, use a std::map<> and the upper_bound() method.

#include <iostream>
#include <map>
#include <stdlib.h>

int main ()
{
  std::map<double, char> cumulative;
  typedef std::map<double, char>::iterator It;

  cumulative[.20]='a';
  cumulative[.30]='b';
  cumulative[.40]='c';
  cumulative[.80]='d';
  cumulative[1.00]='e';

  const int numTests = 10;
  for(int i = 0;
      i != numTests;
      ++i)
  {
      double linear = rand()*1.0/RAND_MAX;  
      std::cout << linear << "\t" << cumulative.upper_bound(linear)->second << std::endl;
  }

  return 0;
}

you mean use stl upper_bound lower_bound? could you provide a quick example? — , May 19 '11 at 00:50
@Curzon: To apply Keith's suggestion to your code, rather than assigning a weight to all of your elements, assign the weight + the sum of the preceding weights. Then, select a random value X [0,1), and use set::lower_bound to obtain an iterator to the element whose value is _not less than_ X. (Alternately, use upper_bound if the element should be strictly greater than X) — decltype, May 19 '11 at 06:54

score 1 · Answer 3 · answered Sep 08 '11 at 05:13

If you have a quick enough way to sample a random element uniformly, you can use rejection sampling; all you need to know is the maximum weight. It would work as follows: Suppose the maximum weight is M. Pick a number X uniformly in [0,1]. Sample elements repeatedly until you find one whose weight is at least M*X; choose this one.

Or, an approximate solution: pick 100 elements uniformly at random; choose one proportional to weight within this set.

Fast weighted random selection from very large set of values

3 Answers3

Linked