125

I'm trying to implement a weighted random numbers. I'm currently just banging my head against the wall and cannot figure this out.

In my project (Hold'em hand-ranges, subjective all-in equity analysis), I'm using Boost's random -functions. So, let's say I want to pick a random number between 1 and 3 (so either 1, 2 or 3). Boost's mersenne twister generator works like a charm for this. However, I want the pick to be weighted for example like this:

1 (weight: 90)
2 (weight: 56)
3 (weight:  4)

Does Boost have some sort of functionality for this?

moswald
  • 11,491
  • 7
  • 52
  • 78
nhaa123
  • 9,570
  • 11
  • 42
  • 63

9 Answers9

237

There is a straightforward algorithm for picking an item at random, where items have individual weights:

1) calculate the sum of all the weights

2) pick a random number that is 0 or greater and is less than the sum of the weights

3) go through the items one at a time, subtracting their weight from your random number, until you get the item where the random number is less than that item's weight

Pseudo-code illustrating this:

int sum_of_weight = 0;
for(int i=0; i<num_choices; i++) {
   sum_of_weight += choice_weight[i];
}
int rnd = random(sum_of_weight);
for(int i=0; i<num_choices; i++) {
  if(rnd < choice_weight[i])
    return i;
  rnd -= choice_weight[i];
}
assert(!"should never get here");

This should be straightforward to adapt to your boost containers and such.


If your weights are rarely changed but you often pick one at random, and as long as your container is storing pointers to the objects or is more than a few dozen items long (basically, you have to profile to know if this helps or hinders), then there is an optimisation:

By storing the cumulative weight sum in each item you can use a binary search to pick the item corresponding to the pick weight.


If you do not know the number of items in the list, then there's a very neat algorithm called reservoir sampling that can be adapted to be weighted.

exussum
  • 18,275
  • 8
  • 32
  • 65
Will
  • 73,905
  • 40
  • 169
  • 246
  • 3
    As an optimization you could use cumulative weights and use a binary search. But for only three different values this is probably overkill. – sellibitze Nov 19 '09 at 10:02
  • 2
    I assume when you say "in order" you are purposely omitting a pre-sort step on the choice_weight array, yes? – SilentDirge Oct 31 '11 at 19:17
  • 2
    @Aureis, there is no need to sort the array. I have tried to clarify my language. – Will Nov 01 '11 at 06:19
  • 1
    this is an awesome answer, i used the algorithm in a game to define the appearance frequencies of different types of characters – Emmett Butler Aug 04 '12 at 19:04
  • 1
    Several years late to the party, but in the above pseudo-code shouldn't "if(rnd < choice_weight[i])" be "if(rnd < CUMULATIVE_choice_weight[i])" ? – Wouter Aug 06 '13 at 18:37
  • or to be more correct: 'if(rnd <= CUMULATIVE_choice_weight[i])' – Wouter Aug 06 '13 at 18:56
  • @Wouter in the loop there's `rnd -= choice_weight[i]` so we never have to store cumulative weights. Regards `<=`, should an entry that has a weight of 0 and happens to be first in the list ever be picked? – Will Aug 06 '13 at 19:15
  • 1
    What if we have two or more elements with the same weight in the list? wouldn't the algorithm always pick the first element it finds (with the same weight)? – kobik Oct 07 '13 at 09:09
  • @kobik no, it would pick a random one of them. – Will Oct 07 '13 at 10:51
  • is this a russian roulette ? – v.oddou Mar 19 '14 at 04:46
  • @v.oddou russian roulette is played with a revolver and is quite different :) This method is however called "roulette" selection in many contexts e.g. genetic algos. – Will Mar 19 '14 at 06:35
  • 2
    @Will: Yes, but there is an algorithm of the same name. http://sirkan.iit.bme.hu/~szirmay/c29.pdf and http://en.wikipedia.org/wiki/Photon_mapping `A Monte Carlo method called Russian roulette is used to choose one of these actions` it comes up in buckets when googling for it. "russian roulette algorithm". You could argue that all of these people has the name wrong though. – v.oddou Mar 19 '14 at 07:48
  • @v.oddou I do think they have their names mixed up. When I did genetic programming, we called this "roulette wheel selection". – Will Mar 19 '14 at 11:16
  • @Will: Then my guess is that somewhere in the past decade, some researcher working on monte carlo rendering coined the term to refer to `path discarding` which could be thought of as `path killing`... – v.oddou Mar 20 '14 at 00:32
  • 3
    Note for future readers: the part *subtracting their weight from your random number* is easy to overlook, but crucial for the algorithm (I fell into the same trap as @kobik in their comment). – Frank Schmitt Mar 08 '16 at 11:40
  • @Will Good remark, you need < to exclude elements with a proba of 0. My random generator returns a value in a range, inclusive, so I was about to use <= but that would be incorrect. Instead, I need to handle the special case where `rnd == sum_of_weight` by replacing the assert with `return num_choices - 1;` In addition, I can iterate as long as `i < num_choices - 1` to avoid redundancy for the final case (I prefer exclusive ranges!). @Wouter An algorithm with cumulative weight is indeed possible (e.g. using std::partial_sum), if you can afford memory for an extra array of size `num_choices`. – hsandt Dec 19 '17 at 14:40
  • Just found this other post for a cumulative version using std: https://stackoverflow.com/questions/4116388/what-are-practical-uses-for-stls-partial-sum – hsandt Dec 21 '17 at 16:51
  • @hsandt I had encountered a case where rnd == choice_weight[i] so the code was returning the wrong index. I changed to <= but you mention that it is incorrect? I don't quite get your solution. My element was not at n-1 index position. Here's a reference https://stackoverflow.com/questions/32356801/weighted-sampling-in-fortran – Herman Toothrot Sep 14 '20 at 07:28
  • Well, I've re-read my comment and now I'm not satisfied as my solution is also unbalanced. To balance your probabilities you need to work with [a,b) (or [a,b[ in French notation) interval for index i, where b = a + weight[i]. For integers it's actually {a, ..., b-1}, empty if b <= a. This way if you have a probability of 0, you get a range like [5,5) = empty set so it can never be picked. And if your rnd falls in the highest range [sum_of_weight-last weight, sum_of_weight) then you return the last index, n-1. – hsandt Sep 15 '20 at 20:32
  • But here's the issue: with an inclusive range rnd = random(n) between 0 and n, it's possible that rnd == sum, and you will assert. To avoid increasing the probabilities of the first weights, I kept the < but added a final case when rnd == sum_of_weight to return n - 1. But now I realize this adds an extra weight to the last index n - 1, which is not better than replacing < with <= and adding an extra weight to index 0. Think about the extreme case where index 0 or n-1 have probability 0 and still get picked thx to 0 <= 0 or n-1 entering the final case, and you'll see. – hsandt Sep 15 '20 at 20:32
  • So my conclusion is: use an exclusive random in [0, sum_of_weight). If you only have an inclusive random function, but you're working with integers (and honestly I only know inclusive random functions for integers, since float random tends to return between 0 and 1 excluded), just take random_inclusive(0, sum_of_weight-1). I'm surprised I didn't think of that at the time, maybe I did choose this solution in the end, but I don't have the code to check anymore. I probably also stored cumulative weights in an array and used some std::upper_bound for easy comparison. – hsandt Sep 15 '20 at 20:36
  • this algorithm does not consider negative weight. – hjchin Sep 22 '20 at 07:52
  • @hjchin It doesn't really make sense for there to be negative weights, since the probability should always be positive (unless they were *all* negative weights, in which case, you could just negate the list at the beginning). – Varun Vejalla Jan 29 '21 at 02:06
  • you're right, in common scenario, the weight would be positive. But practically, some scenarios, in my case, it could be increased or decreased. the value might lesser than 0 when user decrease it. And, my workaround is to adjust the all values when there is a -ve value. – hjchin Jan 30 '21 at 10:20
  • Can someone explan why does this work? – Yug Singh May 01 '23 at 08:46
50

Updated answer to an old question. You can easily do this in C++11 with just the std::lib:

#include <iostream>
#include <random>
#include <iterator>
#include <ctime>
#include <type_traits>
#include <cassert>

int main()
{
    // Set up distribution
    double interval[] = {1,   2,   3,   4};
    double weights[] =  {  .90, .56, .04};
    std::piecewise_constant_distribution<> dist(std::begin(interval),
                                                std::end(interval),
                                                std::begin(weights));
    // Choose generator
    std::mt19937 gen(std::time(0));  // seed as wanted
    // Demonstrate with N randomly generated numbers
    const unsigned N = 1000000;
    // Collect number of times each random number is generated
    double avg[std::extent<decltype(weights)>::value] = {0};
    for (unsigned i = 0; i < N; ++i)
    {
        // Generate random number using gen, distributed according to dist
        unsigned r = static_cast<unsigned>(dist(gen));
        // Sanity check
        assert(interval[0] <= r && r <= *(std::end(interval)-2));
        // Save r for statistical test of distribution
        avg[r - 1]++;
    }
    // Compute averages for distribution
    for (double* i = std::begin(avg); i < std::end(avg); ++i)
        *i /= N;
    // Display distribution
    for (unsigned i = 1; i <= std::extent<decltype(avg)>::value; ++i)
        std::cout << "avg[" << i << "] = " << avg[i-1] << '\n';
}

Output on my system:

avg[1] = 0.600115
avg[2] = 0.373341
avg[3] = 0.026544

Note that most of the code above is devoted to just displaying and analyzing the output. The actual generation is just a few lines of code. The output demonstrates that the requested "probabilities" have been obtained. You have to divide the requested output by 1.5 since that is what the requests add up to.

nhaa123
  • 9,570
  • 11
  • 42
  • 63
Howard Hinnant
  • 206,506
  • 52
  • 449
  • 577
  • Just a reminder note on compilation of this example: requires C++ 11 ie. use -std=c++0x compiler flag, available from gcc 4.6 onwards. – Pete855217 May 20 '12 at 09:59
  • 3
    Care to just pick out the necessary parts that solve the problem? – Jonny May 27 '15 at 04:42
  • 4
    This is the best answer, but I think [`std::discrete_distribution`](http://en.cppreference.com/w/cpp/numeric/random/discrete_distribution) instead of `std::piecewise_constant_distribution` would have been even better. – Dan Mar 02 '18 at 19:28
  • 1
    @Dan, Yes, that would be another excellent way to do it. If you code it up and answer with it, I'll vote for it. I think the code could be pretty similar to what I have above. You would just need to add one to the generated output. And the input to the distribution would be simpler. A compare/contrast set of answers in this area might be valuable to the readers. – Howard Hinnant Mar 02 '18 at 23:34
19

If your weights change more slowly than they are drawn, C++11 discrete_distribution is going to be the easiest:

#include <random>
#include <vector>
std::vector<double> weights{90,56,4};
std::discrete_distribution<int> dist(std::begin(weights), std::end(weights));
std::mt19937 gen;
gen.seed(time(0));//if you want different results from different runs
int N = 100000;
std::vector<int> samples(N);
for(auto & i: samples)
    i = dist(gen);
//do something with your samples...

Note, however, that the c++11 discrete_distribution computes all of the cumulative sums on initialization. Usually, you want that because it speeds up the sampling time for a one time O(N) cost. But for a rapidly changing distribution it will incur a heavy calculation (and memory) cost. For instance if the weights represented how many items there are and every time you draw one, you remove it, you will probably want a custom algorithm.

Will's answer https://stackoverflow.com/a/1761646/837451 avoids this overhead but will be slower to draw from than the C++11 because it can't use binary search.

To see that it does this, you can see the relevant lines (/usr/include/c++/5/bits/random.tcc on my Ubuntu 16.04 + GCC 5.3 install):

  template<typename _IntType>
    void
    discrete_distribution<_IntType>::param_type::
    _M_initialize()
    {
      if (_M_prob.size() < 2)
        {
          _M_prob.clear();
          return;
        }

      const double __sum = std::accumulate(_M_prob.begin(),
                                           _M_prob.end(), 0.0);
      // Now normalize the probabilites.
      __detail::__normalize(_M_prob.begin(), _M_prob.end(), _M_prob.begin(),
                            __sum);
      // Accumulate partial sums.
      _M_cp.reserve(_M_prob.size());
      std::partial_sum(_M_prob.begin(), _M_prob.end(),
                       std::back_inserter(_M_cp));
      // Make sure the last cumulative probability is one.
      _M_cp[_M_cp.size() - 1] = 1.0;
    }
Community
  • 1
  • 1
mmdanziger
  • 4,466
  • 2
  • 31
  • 47
13

What I do when I need to weight numbers is using a random number for the weight.

For example: I need that generate random numbers from 1 to 3 with the following weights:

  • 10% of a random number could be 1
  • 30% of a random number could be 2
  • 60% of a random number could be 3

Then I use:

weight = rand() % 10;

switch( weight ) {

    case 0:
        randomNumber = 1;
        break;
    case 1:
    case 2:
    case 3:
        randomNumber = 2;
        break;
    case 4:
    case 5:
    case 6:
    case 7:
    case 8:
    case 9:
        randomNumber = 3;
        break;
}

With this, randomly it has 10% of the probabilities to be 1, 30% to be 2 and 60% to be 3.

You can play with it as your needs.

Hope I could help you, Good Luck!

Chirry
  • 640
  • 6
  • 12
2

Build a bag (or std::vector) of all the items that can be picked.
Make sure that the number of each items is proportional to your weighting.

Example:

  • 1 60%
  • 2 35%
  • 3 5%

So have a bag with 100 items with 60 1's, 35 2's and 5 3's.
Now randomly sort the bag (std::random_shuffle)

Pick elements from the bag sequentially until it is empty.
Once empty re-randomize the bag and start again.

Martin York
  • 257,169
  • 86
  • 333
  • 562
  • 9
    if you have a bag of red and blue marbles and you select a red marble from it and _don't_ replace it is the probability of selecting another red marble still the same? In the same way, your statement "Pick elements from the bag sequentially until it is empty" produces a totally different distribution than intended. – ldog Sep 23 '10 at 18:14
  • @ldog: I understand your argument but we are not looking for true randomness we are looking for a particular distribution. This technique guarantees the correct distribution. – Martin York Sep 23 '10 at 19:32
  • 7
    my point exactly is that you do not correctly produce distribution, by my previous argument. consider the simple counter example, say you put you have an array of 3 as `1,2,2` producing 1 1/3 of the time and 2 2/3. Randomize the array, pick the first, lets say a 2, now the next element you pick follows the distribution of 1 1/2 the time and 2 1/2 the time. Savvy? – ldog Sep 23 '10 at 22:55
1

Choose a random number on [0,1), which should be the default operator() for a boost RNG. Choose the item with cumulative probability density function >= that number:

template <class It,class P>
It choose_p(It begin,It end,P const& p)
{
    if (begin==end) return end;
    double sum=0.;
    for (It i=begin;i!=end;++i)
        sum+=p(*i);
    double choice=sum*random01();
    for (It i=begin;;) {
        choice -= p(*i);
        It r=i;
        ++i;
        if (choice<0 || i==end) return r;
    }
    return begin; //unreachable
}

Where random01() returns a double >=0 and <1. Note that the above doesn't require the probabilities to sum to 1; it normalizes them for you.

p is just a function assigning a probability to an item in the collection [begin,end). You can omit it (or use an identity) if you just have a sequence of probabilities.

Jonathan Graehl
  • 9,182
  • 36
  • 40
0

This is my understanding of a "weighted random", I've been using this recently. (Code is in Python but can be implemented in other langs)

Let's say you want to pick a random person and they don't have equal chances of being selected You can give each person a "weight" or "chance" value:

choices = [("Ade", 60), ("Tope", 50), ("Maryamu", 30)]

You use their weights to calculate a score for each then find the choice with the highest score

highest = [None, 0]
for p in choices:
    score = math.floor(random.random() * p[1])
    if score > highest[1]:
        highest[0] = p
        highest[1] = score

print(highest)

For Ade the highest score they can get is 60, Tope 50 and so on, meaning that Ade has a higher chance of generating the largest score than the rest.

You can use any range of weights, the greater the difference the more skewed the distribution. E.g if Ade had a weight of 1000 they will almost always be chosen.

Test

votes = [{"name": "Ade", "votes": 0}, {"name": "Tope", "votes": 0}, {"name": "Maryamu", "votes": 0]
for v in range(100):
        
        highest = [None, 0]
        for p in choices:
            score = math.floor(random.random() * p[1])
            
            if score > highest[1]:
                highest[0] = p
                highest[1] = score

        candidate = choices(index(highest[0])) # get index of person
        votes[candidate]["count"] += 1 # increase vote count
print(votes)
// votes printed at the end. your results might be different
[{"name": "Ade", "votes": 45}, {"name": "Tope", "votes": 30}, {"name": "Maryamu", "votes": 25}]

Issues

It looks like the more the voters, the more predictable the results. Welp

Hope this gives someone an idea...

LeanKhan
  • 123
  • 1
  • 11
0

I have just implemented the given solution by "will"

#include <iostream>
#include <map>

using namespace std;


template < class T >
class WeightedRandomSample
{
public:
    void SetWeigthMap( map< T , unsigned int >& WeightMap )
    {
        m_pMap = &WeightMap;
    }
    
    T GetRandomSample()
    {
        unsigned int sum_of_weight = GetSumOfWeights();
        unsigned int rnd = (rand() % sum_of_weight);
        map<T , unsigned int>& w_map = *m_pMap;
        typename map<T , unsigned int>::iterator it;
        for(it = w_map.begin() ; it != w_map.end() ; ++it )
        {
            unsigned int w = it->second;
            if(rnd < w)
                return (it->first);
            rnd -= w;
        }
        //assert(!"should never get here");
        T* t = NULL;
        return *(t);
    }
    
    unsigned int GetSumOfWeights()
    {
        if(m_pMap == NULL)
            return 0;
        unsigned int sum = 0;
        map<T , unsigned int>& w_map = *m_pMap;
        typename map<T , unsigned int>::iterator it;
        
        for(it = w_map.begin() ; it != w_map.end() ; ++it )
        {
            sum += it->second;
        }
        return sum;
    }

    
protected:
    map< T , unsigned int>* m_pMap = NULL;
    
};

typedef pair<int , int> PAIR_INT_INT;
typedef map<PAIR_INT_INT ,unsigned int> mul_table_weighted_map;

int main()
{
    
    mul_table_weighted_map m;
    m[PAIR_INT_INT(2,3)] = 10;
    m[PAIR_INT_INT(4,5)] = 20;
    m[PAIR_INT_INT(2,5)] = 10;
    
    WeightedRandomSample<PAIR_INT_INT> WRS;
    WRS.SetWeigthMap(m);
    unsigned int sum_of_weight = WRS.GetSumOfWeights();
    cout <<"Sum of weights : " << sum_of_weight << endl;
    
    unsigned int number_of_test = 10000;
    cout << "testing " << number_of_test << " ..." << endl;
    map<PAIR_INT_INT , unsigned int> check_map;
    for(int i = 0 ; i < number_of_test ; i++)
    {
        PAIR_INT_INT res = WRS.GetRandomSample();
        check_map[res]++;
        //cout << i+1 << ": random = " << res.first << " * " << res.second << endl;
    }
    cout << "results: " << endl;
    
    for(auto t : check_map)
    {
        PAIR_INT_INT p = t.first;
        unsigned int expected = (number_of_test * m[p]) / sum_of_weight;
        cout << " pair " << p.first << " * " << p.second 
            << ", counted = " << t.second
            << ", expected = " << expected
            << endl;
    }

    return 0;
}
mohtashami740
  • 336
  • 2
  • 6
0

For example, generating a random index in a vector of weights for that index can be done this way:

#include <bits/stdc++.h> 
using namespace std;

int getWeightedRandomNumber(vector<int> weights){
  vector<int> vec;
  for(int i=0; i<weights.size(); i++){
    for(int j=0; j<weights[i]; j++){
      vec.push_back(i);
    }
  }
  random_shuffle(vec.begin(), vec.end());
  return vec.front();
}

int main() 
{
  vector<int> v{2,4,5,100,1,2,4,4};
  for(int i=0; i<100; i++){
    cout<<getWeightedRandomNumber(v)<<endl;
  }
  
}

Since we are constructing another vector with (no of elements) = almost (current no of elements) * (mean weight), this approach might now work when dealing with large data.

Light Yagami
  • 961
  • 1
  • 9
  • 29