53

Recently I needed to do weighted random selection of elements from a list, both with and without replacement. While there are well known and good algorithms for unweighted selection, and some for weighted selection without replacement (such as modifications of the resevoir algorithm), I couldn't find any good algorithms for weighted selection with replacement. I also wanted to avoid the resevoir method, as I was selecting a significant fraction of the list, which is small enough to hold in memory.

Does anyone have any suggestions on the best approach in this situation? I have my own solutions, but I'm hoping to find something more efficient, simpler, or both.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
  • see this question http://stackoverflow.com/q/10164303/112100, even thought it's C# not python, it's very few code so anyone should understand it – Omu Apr 15 '12 at 17:39
  • 2
    For anyone else who had to look it up, "reservoir algorithm" is on Wikipedia under "[reservoir sampling](https://en.wikipedia.org/wiki/Reservoir_sampling)". The first paper cited is Jeffrey Scott Vitter's "Random Sampling with a Reservoir", from _ACM Transactions on Mathematical Software_, Vol. 11, No. 1, 01 Mar 1985, pp. 37--57. – Kevin J. Chase Mar 30 '16 at 03:51
  • For weighted-without-replacement, where weight means that the probability of being chosen is proportional to the weight, see my answer here: [stackoverflow.com/a/27049669/262304](http://stackoverflow.com/a/27049669/262304) Note that some inputs don't have a solution, e.g. pick 2 from {'a': 3, 'b': 1, 'c: 1} should yield 'a' 3x as often as either b or c, but that is impossible. – ech Jan 18 '17 at 20:29
  • I'm not even sure that weighted selection without replacement is well defined. For "pick 2 from {'a': 3, 'b': 1, 'c': 1}, one possible interpretation is that `a` should appear 3x as often as `b` or `c` which is impossible, yes. Another possible interpretation is that the results `ab`/`ba`/`ac`/`ca` should have relative weight 3*1, and `bc`/`cb` should have relative weight 1*1. Another possible interpretation is that we choose the first value normally, and the second value according to the relative weights of whatever wasn't chosen the first time - such that `ab` would be more likely than `ba`. – Karl Knechtel Jul 05 '22 at 20:49
  • Out of curiousity, I did some research and found that `numpy.choice` supports this - taking the latter interpretation. – Karl Knechtel Jul 05 '22 at 20:57

9 Answers9

34

One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.

Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)

To create the alias lookup:

  1. Normalize the weights such that they sum to 1.0. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.

  2. Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.

  3. Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that a fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)

  4. If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.

Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.

For example, if we run another iteration of 3 and 4, we see

(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8) with (a:0, b:0.15 c:0.2 d:0.2 e:0.2) left to be assigned

At runtime:

  1. Get a U(0,1) random number, say binary 0.001100000

  2. bitshift it lg2(p), finding the index partition. Thus, we shift it by 3, yielding 001.1, or position 1, and thus partition 2.

  3. If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5 < 0.6, so return a.

Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.

Regexident
  • 29,441
  • 10
  • 93
  • 100
John with waffle
  • 4,113
  • 21
  • 32
  • 1
    The bitwise trick is neat, but keep in mind that the random number used has to be large enough to select a partition and to select a value within that partition. I'm not sure how to calculate the required number of bits needed to calculate the 2nd part, but one should make sure they have enough bits... (for example, on a 32-bit machine with 2^32 partitions, you're going to need more bits than a single random number!) I just use two random numbers for each sampling. – Ryan Marcus Jun 25 '12 at 23:45
  • This is true, you need to know how many random bits you are promised by your generator for a given sample for this to work correctly. If you don't know, take two, because on modern generators the phase (or uniform dependence between samples) is very large. – John with waffle Nov 05 '12 at 17:57
  • 1
    Here is a Ruby implementation of the Walker Alias method as well: https://github.com/cantino/walker_method – Andrew Mar 05 '13 at 07:21
  • You don't need the next greatest power of two restriction. N bins for N weights works fine. In steep 3, you don't need an item with the least remaining weight, only one with less than the average. This actually speeds up the algorithm a lot, because you don't need to sort the weights, only partition them into light/heavy. – Rob Neuhaus Mar 19 '13 at 18:40
  • The power of two is for bit shifting. You don't have to use bit shifting, and if you don't you are not limited to powers of two. Also, the lightest remaining weight is taken at lookup build-time, not sample time, so it doesn't make much difference. If even that is a concern, use a min-heap. I understand there are some subtle correctness cases if you don't select the minimum, but I don't recall them. In other words, do otherwise at your own risk. – John with waffle Mar 20 '13 at 16:51
14

A simple approach that hasn't been mentioned here is one proposed in Efraimidis and Spirakis. In python you could select m items from n >= m weighted items with strictly positive weights stored in weights, returning the selected indices, with:

import heapq
import math
import random

def WeightedSelectionWithoutReplacement(weights, m):
    elt = [(math.log(random.random()) / weights[i], i) for i in range(len(weights))]
    return [x[1] for x in heapq.nlargest(m, elt)]

This is very similar in structure to the first approach proposed by Nick Johnson. Unfortunately, that approach is biased in selecting the elements (see the comments on the method). Efraimidis and Spirakis proved that their approach is equivalent to random sampling without replacement in the linked paper.

josliber
  • 43,891
  • 12
  • 98
  • 133
  • 2
    Optimized (2.5k gas) Solidity version of log2(0..1) can be found here: https://gist.github.com/k06a/af6c58fe6634e48e53929451877eb5b5 – k06a Aug 12 '19 at 17:55
5

Here's what I came up with for weighted selection without replacement:

def WeightedSelectionWithoutReplacement(l, n):
  """Selects without replacement n random elements from a list of (weight, item) tuples."""
  l = sorted((random.random() * x[0], x[1]) for x in l)
  return l[-n:]

This is O(m log m) on the number of items in the list to be selected from. I'm fairly certain this will weight items correctly, though I haven't verified it in any formal sense.

Here's what I came up with for weighted selection with replacement:

def WeightedSelectionWithReplacement(l, n):
  """Selects with replacement n random elements from a list of (weight, item) tuples."""
  cuml = []
  total_weight = 0.0
  for weight, item in l:
    total_weight += weight
    cuml.append((total_weight, item))
  return [cuml[bisect.bisect(cuml, random.random()*total_weight)] for x in range(n)]

This is O(m + n log m), where m is the number of items in the input list, and n is the number of items to be selected.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
  • 6
    That first function is brilliant, but alas it doesn't weight the items correctly. Consider `WeightedSelectionWithoutReplacement([(1, 'A'), (2, 'B')], 1)`. It will choose A with probability 1/4, not 1/3. Hard to fix. – Jason Orendorff Jan 28 '10 at 14:27
  • Btw, faster but more complex algorithms are in my answer here: http://stackoverflow.com/questions/2140787/select-random-k-elements-from-a-list-whose-elements-have-weights/2149533#2149533 – Jason Orendorff Jan 28 '10 at 14:43
  • Nice find @JasonOrendorff. In fact the difference is quite bad. For weights (1, 2, 3, 4), you'd expect "1" to be chosen 1/10 of the time, but it'll be chosen 1/94 of the time. I really wanted that to work! – Lawrence Kesteloot Mar 09 '12 at 05:11
  • 1
    @JasonOrendorff: How did you calculate 1/4? If you have a formula for that, can we invert it and replace the original weights with weights that will give correct results? – Lawrence Kesteloot Mar 09 '12 at 06:44
  • @LawrenceKesteloot – for the 1/4, here's how I look at it: (random()*1) ranges from 0–1. If it's 1, the chance that it is larger than (random()*2) is 1/2. If it's 0, the chance is 0. The average chance is: 1/4. – ech Jan 18 '17 at 20:12
4

It is possible to do Weighted Random Selection with replacement in O(1) time, after first creating an additional O(N)-sized data structure in O(N) time. The algorithm is based on the Alias Method developed by Walker and Vose, which is well described here.

The essential idea is that each bin in a histogram would be chosen with probability 1/N by a uniform RNG. So we will walk through it, and for any underpopulated bin which would would receive excess hits, assign the excess to an overpopulated bin. For each bin, we store the percentage of hits which belong to it, and the partner bin for the excess. This version tracks small and large bins in place, removing the need for an additional stack. It uses the index of the partner (stored in bucket[1]) as an indicator that they have already been processed.

Here is a minimal python implementation, based on the C implementation here

def prep(weights):
    data_sz = len(weights)
    factor = data_sz/float(sum(weights))
    data = [[w*factor, i] for i,w in enumerate(weights)]
    big=0
    while big<data_sz and data[big][0]<=1.0: big+=1
    for small,bucket in enumerate(data):
        if bucket[1] is not small: continue
        excess = 1.0 - bucket[0]
        while excess > 0:
            if big==data_sz: break
            bucket[1] = big
            bucket = data[big]
            bucket[0] -= excess
            excess = 1.0 - bucket[0]
            if (excess >= 0):
                big+=1
                while big<data_sz and data[big][0]<=1: big+=1
    return data

def sample(data):
    r=random.random()*len(data)
    idx = int(r)
    return data[idx][1] if r-idx > data[idx][0] else idx

Example usage:

TRIALS=1000
weights = [20,1.5,9.8,10,15,10,15.5,10,8,.2];
samples = [0]*len(weights)
data = prep(weights)

for _ in range(int(sum(weights)*TRIALS)):
    samples[sample(data)]+=1

result = [float(s)/TRIALS for s in samples]
err = [a-b for a,b in zip(result,weights)]
print(result)
print([round(e,5) for e in err])
print(sum([e*e for e in err]))
AShelly
  • 34,686
  • 15
  • 91
  • 152
4

I'd recommend you start by looking at section 3.4.2 of Donald Knuth's Seminumerical Algorithms.

If your arrays are large, there are more efficient algorithms in chapter 3 of Principles of Random Variate Generation by John Dagpunar. If your arrays are not terribly large or you're not concerned with squeezing out as much efficiency as possible, the simpler algorithms in Knuth are probably fine.

John D. Cook
  • 29,517
  • 10
  • 67
  • 94
  • 6
    I just took a look at section 3.4.2, and it covers only unbiased selection with and without replacement - there's no mention made of weighted selection. – Nick Johnson Dec 09 '08 at 16:25
  • §3.4.1 discusses Walker's alias method, which is for weighted selection with replacement. – rob mayoff Jul 28 '19 at 05:32
4

The following is a description of random weighted selection of an element of a set (or multiset, if repeats are allowed), both with and without replacement in O(n) space and O(log n) time.

It consists of implementing a binary search tree, sorted by the elements to be selected, where each node of the tree contains:

  1. the element itself (element)
  2. the un-normalized weight of the element (elementweight), and
  3. the sum of all the un-normalized weights of the left-child node and all of its children (leftbranchweight).
  4. the sum of all the un-normalized weights of the right-child node and all of its chilren (rightbranchweight).

Then we randomly select an element from the BST by descending down the tree. A rough description of the algorithm follows. The algorithm is given a node of the tree. Then the values of leftbranchweight, rightbranchweight, and elementweight of node is summed, and the weights are divided by this sum, resulting in the values leftbranchprobability, rightbranchprobability, and elementprobability, respectively. Then a random number between 0 and 1 (randomnumber) is obtained.

  • if the number is less than elementprobability,
    • remove the element from the BST as normal, updating leftbranchweight and rightbranchweight of all the necessary nodes, and return the element.
  • else if the number is less than (elementprobability + leftbranchweight)
    • recurse on leftchild (run the algorithm using leftchild as node)
  • else
    • recurse on rightchild

When we finally find, using these weights, which element is to be returned, we either simply return it (with replacement) or we remove it and update relevant weights in the tree (without replacement).

DISCLAIMER: The algorithm is rough, and a treatise on the proper implementation of a BST is not attempted here; rather, it is hoped that this answer will help those who really need fast weighted selection without replacement (like I do).

djhaskin987
  • 9,741
  • 4
  • 50
  • 86
  • For the weighted-without-replacement algorithm, this produces the wrong result. That is, elements will not be chosen with a probability proportional to their weights. See http://stackoverflow.com/a/27049669/262304 for a solution. – ech Jan 18 '17 at 19:57
  • 1
    I've implemented that algorithm (or a slight refinement) here btw https://github.com/makoConstruct/jostletree.rs (it's rust though) – mako Sep 21 '21 at 09:34
1

This is an old question for which numpy now offers an easy solution so I thought I would mention it. Current version of numpy is version 1.2 and numpy.random.choice allows the sampling to be done with or without replacement and with given weights.

0

Suppose you want to sample 3 elements without replacement from the list ['white','blue','black','yellow','green'] with a prob. distribution [0.1, 0.2, 0.4, 0.1, 0.2]. Using numpy.random module it is as easy as this:

    import numpy.random as rnd

    sampling_size = 3
    domain = ['white','blue','black','yellow','green']
    probs = [.1, .2, .4, .1, .2]
    sample = rnd.choice(domain, size=sampling_size, replace=False, p=probs)
    # in short: rnd.choice(domain, sampling_size, False, probs)
    print(sample)
    # Possible output: ['white' 'black' 'blue']

Setting the replace flag to True, you have a sampling with replacement.

More info here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html#numpy.random.choice

Maroxo
  • 11
  • 1
0

We faced a problem to randomly select K validators of N candidates once per epoch proportionally to their stakes. But this gives us the following problem:

Imagine probabilities of each candidate:

0.1
0.1
0.8

Probabilities of each candidate after 1'000'000 selections 2 of 3 without replacement became:

0.254315
0.256755
0.488930

You should know, those original probabilities are not achievable for 2 of 3 selection without replacement.

But we wish initial probabilities to be a profit distribution probabilities. Else it makes small candidate pools more profitable. So we realized that random selection with replacement would help us – to randomly select >K of N and store also weight of each validator for reward distribution:

std::vector<int> validators;
std::vector<int> weights(n);
int totalWeights = 0;

for (int j = 0; validators.size() < m; j++) {
    int value = rand() % likehoodsSum;
    for (int i = 0; i < n; i++) {
        if (value < likehoods[i]) {
            if (weights[i] == 0) {
                validators.push_back(i);
            }
            weights[i]++;
            totalWeights++;
            break;
        }

        value -= likehoods[i];
    }
}

It gives an almost original distribution of rewards on millions of samples:

0.101230
0.099113
0.799657
k06a
  • 17,755
  • 10
  • 70
  • 110