81

I would like to randomly select one element from an array, but each element has a known probability of selection.

All chances together (within the array) sums to 1.

What algorithm would you suggest as the fastest and most suitable for huge calculations?

Example:

id => chance
array[
    0 => 0.8
    1 => 0.2
]

for this pseudocode, the algorithm in question should on multiple calls statistically return four elements on id 0 for one element on id 1.

Nicholas Carey
  • 71,308
  • 16
  • 93
  • 135
Mikulas Dite
  • 7,790
  • 9
  • 59
  • 99

15 Answers15

79

Compute the discrete cumulative density function (CDF) of your list -- or in simple terms the array of cumulative sums of the weights. Then generate a random number in the range between 0 and the sum of all weights (might be 1 in your case), do a binary search to find this random number in your discrete CDF array and get the value corresponding to this entry -- this is your weighted random number.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • The original array has about 500 elements, and the chances may differ very slightly, making the new weighed array really huge (I guess it would quickly reach over 10e5 elements). Would that effect the performance? Again, I have to use this function really often. – Mikulas Dite Dec 16 '10 at 17:28
  • 6
    @Mikulas Dite: This binary search would take `log2(500) = 9` steps per lookup. – thejh Dec 16 '10 at 17:33
  • First I thought that you are talking about creating the new array as @thejh suggested. However, I get it now. Thanks for the best solution! : ) – Mikulas Dite Dec 16 '10 at 17:40
  • 2
    Generating a random number between 0 and the sum of weights, who can guarantee that the generated random number will be in the cdf array? Let's assume to have [0.1 0.2 0.4 0.3] as array of the weights. the cdf array will be [0.1 0.3 0.7 1.0]. the rand value has to be generated between 0 and 1.0. then could be for example 0.62 but that value is not in the cdf array. – Mazzy Mar 16 '15 at 01:06
  • 3
    @Mazzy: You are looking for the interval containing the random number you generated -- in this case the interval form 0.3 to 0.7. Of course you can't expect the exact value to appear, but a binary search for finding the interval will work anyway. – Sven Marnach Mar 23 '15 at 16:35
  • 1
    @SvenMarnach Maybe something is not clear for me. When I apply binary search to cdf array [0.1 0.3 0.7 0.1] what I expect is to find the rand value in the array. In that example above the rand value is 0.62. the binary search algorithm applied to the cdf array will look for 0.62 value in the array and if it don't find this value, it will get out "not found". What I mean is that binary search has to find the right value otherwise no value will be returned – Mazzy Mar 23 '15 at 20:01
  • 2
    @Mazzy: Binary search can be easily used to find the interval the value you are looking for lies in, and that's all you need. Most binary search implementations in standard libraries of programming languages don't require the exact value to be found, e.g. [`lower_bound()` in C++](http://en.cppreference.com/w/cpp/algorithm/lower_bound) or [`bisect_left()` in Python](https://docs.python.org/2/library/bisect.html#bisect.bisect_left). – Sven Marnach Mar 30 '15 at 10:18
  • @SvenMarnach - using this implementation, if I need to choose n number of unique objects from the array, after I select an object and remove it, do I need to re-calculate the total weight and regenerate the array of cumulative weights each time? – Mike Apr 23 '15 at 23:04
  • @Mike: That would work. You can also simply select further objects from the list, and retry if you happen to hit one that you've already got. This approach is better if the total weights of objects that are selected during this process is likely to be small, because you rarely need to retry in that case. – Sven Marnach Apr 26 '15 at 16:01
  • @SvenMarnach Thanks for the info! I ended up implementing tournament selection rather than roulette wheel for my GA for a number of reasons, but this was indeed very helpful! – Mike Apr 26 '15 at 19:58
  • 1
    I couldn't quite understand the approach of Sven Marnach. @SvenMarnach - say the example of A = [1,2,3], Pa = [0.2, 0.6, 0.2] = meaning i want to see 2 three times more often than 1 or 3 in the long run. In this case, cdf(Pa) = [0.2, 0.8, 1.0] Now, generating a rand number between 0 and 1, lets say I get 0.9. How do i go ahead with the binary search? Do i choose 2 or 3 ? Please clarify on this. Your solution looks the best so far. Thanks – Karan Kapoor Sep 30 '15 at 07:45
  • 1
    @KaranKapoor 0.9 is in the last interval, so you choose 3. Most binary search algorithms will return the index the random number would have to be inserted at to keep the list ordered. You can use that index directly as an index to `A`. – Sven Marnach Sep 30 '15 at 10:27
  • You can get to *O(1)* generation time instead of *O(log(n))* by using the [Alias Method](https://en.wikipedia.org/wiki/Alias_method). I added this as an answer. – Simon Baumgardt-Wellander Aug 29 '16 at 05:37
  • @SimonBaumgardt-Wellander There already is an [answer by jonderry](http://stackoverflow.com/a/4463960/279627) describing a simple method that has O(1) expected complexity (which deserves more upvotes than it got). That said, In practice O(log(n)) is good enough most of the time, and the implementation is easier. – Sven Marnach Aug 29 '16 at 10:30
16

The algorithm is straight forward

rand_no = rand(0,1)
for each element in array 
     if(rand_num < element.probablity)
          select and break
     rand_num = rand_num - element.probability
  • This would not work, because I have the chances, not the area. | Even though someone downvoted this answer, it gave me a viable idea. The limits are quite simply computed and should not effect performace. – Mikulas Dite Dec 16 '10 at 17:31
  • @Mikulas assuming you have discrete chances and random number equally distributed between 0 and 1 it will give probability equal to their weight. For your case there is 80% chances random number would be less then .8 hence first element will be selected and 20% chance its greater then .8 in that case second element will be selected. –  Dec 16 '10 at 17:44
  • This would require ordering the array, starting with the lowest chances to get selected, wouldn't it? That is a computation I can't afford. (Notice that I do not preserve the list of previously selected elements) – Mikulas Dite Dec 16 '10 at 17:51
  • 1
    No it will work without sorting, and works faster then binary search if you want to remove the element once it is selected. –  Dec 16 '10 at 17:53
  • It is basically the same as mine answer without precomputing the CDF, so you have to use linear search instead of binary search. But of course it works. – Sven Marnach Dec 16 '10 at 17:54
  • To add let say the element in your example were other way round then probability of random number being less then .2 is 20% and greater then .2 is 80% hence it proves it is order independent. –  Dec 16 '10 at 17:55
  • 6
    Sorry for the question, what if I had two element with the same weight? In this case I would get only the first one of the two elements in the array or I am wrong? – arpho May 29 '14 at 21:04
  • 1
    @arpho I tested your hypothesis [in JavaScript](https://jsfiddle.net/mwbrympx/1/). It looks like you're wrong. – 4castle Jul 18 '17 at 17:38
  • To make this algorithm work you must ensure that 1) the sum of probabilities adds up to 1 (or whatever you pass as the second parameter to `rand`) 2) you do the very important subtraction step on the last line. – Vladimir Panteleev Aug 15 '22 at 20:06
12

I have found this article to be the most useful at understanding this problem fully. This stackoverflow question may also be what you're looking for.


I believe the optimal solution is to use the Alias Method (wikipedia). It requires O(n) time to initialize, O(1) time to make a selection, and O(n) memory.

Here is the algorithm for generating the result of rolling a weighted n-sided die (from here it is trivial to select an element from a length-n array) as take from this article. The author assumes you have functions for rolling a fair die (floor(random() * n)) and flipping a biased coin (random() < p).

Algorithm: Vose's Alias Method

Initialization:

  1. Create arrays Alias and Prob, each of size n.
  2. Create two worklists, Small and Large.
  3. Multiply each probability by n.
  4. For each scaled probability pi:
    1. If pi < 1, add i to Small.
    2. Otherwise (pi ≥ 1), add i to Large.
  5. While Small and Large are not empty: (Large might be emptied first)
    1. Remove the first element from Small; call it l.
    2. Remove the first element from Large; call it g.
    3. Set Prob[l]=pl.
    4. Set Alias[l]=g.
    5. Set pg := (pg+pl)−1. (This is a more numerically stable option.)
    6. If pg<1, add g to Small.
    7. Otherwise (pg ≥ 1), add g to Large.
  6. While Large is not empty:
    1. Remove the first element from Large; call it g.
    2. Set Prob[g] = 1.
  7. While Small is not empty: This is only possible due to numerical instability.
    1. Remove the first element from Small; call it l.
    2. Set Prob[l] = 1.

Generation:

  1. Generate a fair die roll from an n-sided die; call the side i.
  2. Flip a biased coin that comes up heads with probability Prob[i].
  3. If the coin comes up "heads," return i.
  4. Otherwise, return Alias[i].
Community
  • 1
  • 1
8

Here is an implementation in Ruby:

def weighted_rand(weights = {})
  raise 'Probabilities must sum up to 1' unless weights.values.inject(&:+) == 1.0
  raise 'Probabilities must not be negative' unless weights.values.all? { |p| p >= 0 }
  # Do more sanity checks depending on the amount of trust in the software component using this method,
  # e.g. don't allow duplicates, don't allow non-numeric values, etc.
  
  # Ignore elements with probability 0
  weights = weights.reject { |k, v| v == 0.0 }   # e.g. => {"a"=>0.4, "b"=>0.4, "c"=>0.2}

  # Accumulate probabilities and map them to a value
  u = 0.0
  ranges = weights.map { |v, p| [u += p, v] }   # e.g. => [[0.4, "a"], [0.8, "b"], [1.0, "c"]]

  # Generate a (pseudo-)random floating point number between 0.0(included) and 1.0(excluded)
  u = rand   # e.g. => 0.4651073966724186
  
  # Find the first value that has an accumulated probability greater than the random number u
  ranges.find { |p, v| p > u }.last   # e.g. => "b"
end

How to use:

weights = {'a' => 0.4, 'b' => 0.4, 'c' => 0.2, 'd' => 0.0}

weighted_rand weights

What to expect roughly:

sample = 1000.times.map { weighted_rand weights }
sample.count('a') # 396
sample.count('b') # 406
sample.count('c') # 198
sample.count('d') # 0
wteuber
  • 1,208
  • 9
  • 15
  • Just used this and realized a recognized the name! Thanks @wolfgang-teuber! – Abe Petrillo Aug 13 '16 at 14:15
  • 2
    One caveat with this method, is that if you have a weighting of 1.0 and the rest as 0.0 this method will not work as expected. We had the weightings as ENV variables and when we switched one of the weightings to 1.0 (i.e making it always true) it had the opposite affect. Just an FYI for others out there who use this method! – Abe Petrillo Nov 03 '16 at 18:51
  • @AbePetrillo I updated the `weighted_rand` method to fix the issue you described. – wteuber Aug 23 '18 at 23:00
6

An example in ruby

#each element is associated with its probability
a = {1 => 0.25 ,2 => 0.5 ,3 => 0.2, 4 => 0.05}

#at some point, convert to ccumulative probability
acc = 0
a.each { |e,w| a[e] = acc+=w }

#to select an element, pick a random between 0 and 1 and find the first   
#cummulative probability that's greater than the random number
r = rand
selected = a.find{ |e,w| w>r }

p selected[0]
krusty.ar
  • 4,015
  • 1
  • 25
  • 28
  • 6
    In this algorithm, the last element will never be selected as it's probability is 1.0, and rand will always be between 0 and 1. – Matt Darby Jun 13 '13 at 13:07
6

This can be done in O(1) expected time per sample as follows.

Compute the CDF F(i) for each element i to be the sum of probabilities less than or equal to i.

Define the range r(i) of an element i to be the interval [F(i - 1), F(i)].

For each interval [(i - 1)/n, i/n], create a bucket consisting of the list of the elements whose range overlaps the interval. This takes O(n) time in total for the full array as long as you are reasonably careful.

When you randomly sample the array, you simply compute which bucket the random number is in, and compare with each element of the list until you find the interval that contains it.

The cost of a sample is O(the expected length of a randomly chosen list) <= 2.

jonderry
  • 23,013
  • 32
  • 104
  • 171
  • This algorithm has a worst-case complexity of O(n) if the weights are of vastly different magnitudes. It might happen that all intervals belong to the same bucket. Without additional restrictions on the weights, this is definitely not O(1) and not even O(log n). – Sven Marnach Dec 16 '10 at 18:19
  • The worst case occurs only rarely. If all n intervals overlapped one bucket, then almost all queries would require a comparison to just one interval. In practice, this will be significantly faster than binary search. If you insist on optimizing for the worst case, you could do binary search inside each bucket, making the cost of each query cost O(lg(the length of the biggest bucket)) in the worst case, and O(the expectation of lg(the length of a randomly chosen list)) in expectation, which is still just O(1). – jonderry Dec 16 '10 at 18:26
  • Thanks, it looks really well. I will have to run some trials in order to determine if it is really faster method than CDF-way in my solution. – Mikulas Dite Dec 16 '10 at 18:42
  • On average it is O(1), that's true. +1 – Sven Marnach Dec 16 '10 at 19:01
  • 1
    @Mikulas Dite, It's worth stressing that this is a CDF-array solution as well, and the difference with pure binary search is kind of like the difference between doing binary search and hashing to search for an element in an array. Another way of looking at it is that you compute the CDF array, and rather than do binary search on it, you hash the random number to the array index corresponding to the start of the bucket. Then you can use whatever search strategy you want (e.g., either brute-force linear search, or binary search) to narrow down further to the correct sampled element. – jonderry Dec 16 '10 at 21:25
  • 1
    Note that you have better guarantees here than in your usual "worst case" evaluation, because your accesses are *known* to be random, by construction... – comingstorm Dec 17 '10 at 00:19
5

This is a PHP code I used in production:

/**
 * @return \App\Models\CdnServer
*/
protected function selectWeightedServer(Collection $servers)
{
    if ($servers->count() == 1) {
        return $servers->first();
    }

    $totalWeight = 0;

    foreach ($servers as $server) {
        $totalWeight += $server->getWeight();
    }

    // Select a random server using weighted choice
    $randWeight = mt_rand(1, $totalWeight);
    $accWeight = 0;

    foreach ($servers as $server) {
        $accWeight += $server->getWeight();

        if ($accWeight >= $randWeight) {
            return $server;
        }
    }
}
Agent Coop
  • 392
  • 4
  • 12
3

Ruby solution using the pickup gem:

require 'pickup'

chances = {0=>80, 1=>20}
picker = Pickup.new(chances)

Example:

5.times.collect {
  picker.pick(5)
}

gave output:

[[0, 0, 0, 0, 0], 
 [0, 0, 0, 0, 0], 
 [0, 0, 0, 1, 1], 
 [0, 0, 0, 0, 0], 
 [0, 0, 0, 0, 1]]
devstopfix
  • 6,698
  • 4
  • 34
  • 32
3

The most efficient algorithm, it seems to me, is to produce, for each element of the array, a random number drawn from an exponential distribution with parameter given by the weight for that element. As you go through the array, retain the element with the lowest such ‘ordering number’. In this case, the probability that a particular element has the lowest ordering number of the array is proportional to the array element's weight.

Details and code below.

This algorithm is O(n), involves no sorting or extra storage, and the selection can be done in the course of a single pass through the array. The weights must be greater than zero, but don't have to sum to any particular value.

Additional feature: if you store the ordering number with each array element, you have the option to sort the array by increasing ordering number, to get a random ordering of the array in which elements with higher weights have a higher probability of coming early (I've found this useful when deciding which DNS SRV record to pick, to decide which machine to query).

Other algorithms: Repeated random sampling with replacement requires a new pass through the array each time; for random selection without replacement, the array can be sorted in order of increasing ordering number, and k elements can be read out in that order.

See the Wikipedia page about the exponential distribution (in particular the remarks about the distribution of the minima of an ensemble of such variates) for the proof that the above is true, and also for the pointer towards the technique of generating such variates: if T has a uniform random distribution in [0,1), then Z=-log(1-T)/w (where w is the parameter of the distribution; here the weight of the associated element) has an exponential distribution.

That is:

  1. For each element i in the array, calculate zi = -log(1-T)/wi, where T is drawn from a uniform distribution in [0,1), and wi is the weight of the i'th element.
  2. When going through the array, retain a reference to the element which has the lowest zi so far.

The element i will be selected with probability wi/(w1+w2+...+wn).

See below for an illustration of this in Python, which takes a single pass through the array of weights, for each of 10000 trials.

import math, random

random.seed()

weights = [10, 20, 50, 20]
nw = len(weights)
results = [0 for i in range(nw)]

n = 10000
while n > 0: # do n trials
    smallest_i = 0
    smallest_z = -math.log(1-random.random())/weights[0]
    for i in range(1, nw):
        z = -math.log(1-random.random())/weights[i]
        if z < smallest_z:
            smallest_i = i
            smallest_z = z

    # we have selected element 'smallest_i'
    results[smallest_i] += 1 # accumulate our choices

    n -= 1

for i in range(nw):
    print("{} -> {}".format(weights[i], results[i]))

Edit (for history): after posting this, I felt sure I couldn't be the first to have thought of it, and another search with this solution in mind shows that this is indeed the case.

  • In an answer to a similar question, Joe K suggested this algorithm (and also noted that someone else must have thought of it before).
  • Another answer to that question, meanwhile, pointed to Efraimidis and Spirakis (preprint), which describes a similar method.
  • I'm pretty sure, looking at it, that the Efraimidis and Spirakis is in fact the same exponential-distribution algorithm in disguise, and this is corroborated by a passing remark in the Wikipedia page about Reservoir sampling that ‘[e]quivalently, a more numerically stable formulation of this algorithm’ is the exponential-distribution algorithm above. The reference there is to a sequence of lecture notes by Richard Arratia; the relevant property of the exponential distribution is mentioned in Sect.1.3 (which mentions that something similar to this is a ‘familiar fact’ in some circles), but not its relationship to the Efraimidis and Spirakis algorithm.
Norman Gray
  • 11,978
  • 2
  • 33
  • 56
2

If the array is small, I would give the array a length of, in this case, five and assign the values as appropriate:

array[
    0 => 0
    1 => 0
    2 => 0
    3 => 0
    4 => 1
]
thejh
  • 44,854
  • 16
  • 96
  • 107
2

"Wheel of Fortune" O(n), use for small arrays only:

function pickRandomWeighted(array, weights) {
    var sum = 0;
    for (var i=0; i<weights.length; i++) sum += weights[i];
    for (var i=0, pick=Math.random()*sum; i<weights.length; i++, pick-=weights[i])
        if (pick-weights[i]<0) return array[i];
}
Sarsaparilla
  • 6,300
  • 1
  • 32
  • 21
1

the trick could be to sample an auxiliary array with elements repetitions which reflect the probability

Given the elements associated with their probability, as percentage:

h = {1 => 0.5, 2 => 0.3, 3 => 0.05, 4 => 0.05 }

auxiliary_array = h.inject([]){|memo,(k,v)| memo += Array.new((100*v).to_i,k) }   

ruby-1.9.3-p194 > auxiliary_array 
 => [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,                                 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4] 

auxiliary_array.sample

if you want to be as generic as possible, you need to calculate the multiplier based on the max number of fractional digits, and use it in the place of 100:

m = 10**h.values.collect{|e| e.to_s.split(".").last.size }.max
masciugo
  • 1,113
  • 11
  • 19
0

I would imagine that numbers greater or equal than 0.8 but less than 1.0 selects the third element.

In other terms:

x is a random number between 0 and 1

if 0.0 >= x < 0.2 : Item 1

if 0.2 >= x < 0.8 : Item 2

if 0.8 >= x < 1.0 : Item 3

Ryan Rich
  • 11
  • 2
0

I am going to improve on https://stackoverflow.com/users/626341/masciugo answer.

Basically you make one big array where the number of times an element shows up is proportional to the weight.

It has some drawbacks.

  1. The weight might not be integer. Imagine element 1 has probability of pi and element 2 has probability of 1-pi. How do you divide that? Or imagine if there are hundreds of such elements.
  2. The array created can be very big. Imagine if least common multiplier is 1 million, then we will need an array of 1 million element in the array we want to pick.

To counter that, this is what you do.

Create such array, but only insert an element randomly. The probability that an element is inserted is proportional the the weight.

Then select random element from usual.

So if there are 3 elements with various weight, you simply pick an element from an array of 1-3 elements.

Problems may arise if the constructed element is empty. That is it just happens that no elements show up in the array because their dice roll differently.

In which case, I propose that the probability an element is inserted is p(inserted)=wi/wmax.

That way, one element, namely the one that has the highest probability, will be inserted. The other elements will be inserted by the relative probability.

Say we have 2 objects.

element 1 shows up .20% of the time. element 2 shows up .40% of the time and has the highest probability.

In thearray, element 2 will show up all the time. Element 1 will show up half the time.

So element 2 will be called 2 times as many as element 1. For generality all other elements will be called proportional to their weight. Also the sum of all their probability are 1 because the array will always have at least 1 element.

Community
  • 1
  • 1
user4951
  • 32,206
  • 53
  • 172
  • 282
  • My math is off. Looks like elements with higher number will have higher actual probability with this technique. I would suggest the most voted answer now. – user4951 Jul 31 '16 at 13:39
0

I wrote an implementation in C#:

https://github.com/cdanek/KaimiraWeightedList

O(1) gets (fast!), O(n) recalculates, O(n) memory use.

Chris
  • 444
  • 2
  • 12