5

I want to efficiently generate a random sample of unique (non-repeated) integers in a (closed) range [0, rnd_max], with each number in the range being possible to choose, and each being associated with a sample weight (the more weight, the more likely it should be that the number is chosen, with probability exactly weight[i] / sum(weight[not_taken]) to be chosen next if it's not already taken in the sample).

I see C++ has std::discrete_distribution which can generate random weighted integers, but if I use it to generate random integers and discard repeated ones, when the sample to take is large relative to the length of the possible range, there will be a lot of failed samples which are already taken, resulting in a highly inefficient procedure. It's not clear to me if Floyd's algorithm has some extension to the case with sample weights (https://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin) - I personally cannot think of one.

It's also possible to e.g. use std::discrete_distribution dropping the weight to zero, or performing a partial weighted shuffle like in this answer: C++. Weighted std::shuffle - but in that answer, std::discrete_distribution is re-generated at each iteration and thus the running time becomes quadratic (it needs to cycle through the weights that are passed to it every time).

In wondering what could be an efficient weighted random sample for unique integers in C++, that would work well for varying sample sizes (e.g. from 1% to 90% of sampled numbers in the available range).

#include <vector>
#include <random>
#include <algorithm>

int main()
{
    size_t rnd_max = 1e5;
    size_t ntake = 1e3;

    unsigned int seed = 12345;
    std::mt19937 rng(seed);
    std::gamma_distribution<double> rgamma(1.0, 1.0);
    std::vector<double> weights(rnd_max);
    for (double &w : weights) w = rgamma(rng);

    std::vector<int> chosen_sample(ntake);
    // sampler goes here...

    return 0;
}
anymous.asker
  • 1,179
  • 9
  • 14
  • 1
    I'm not so familiar with C++ distributions, so I don't know one. I can tell you how to implement it yourself in `O(n log^2 n)` total time (`log^2 n` time for each sampling) using `uniform_distribution`. Does it interest you? –  Aug 21 '19 at 21:52
  • If they're "not repeated" then they're not random! – Adrian Mole Aug 21 '19 at 22:02
  • @dyukha : yes, please, that'd be great too. @Adrian: yes they are: imagine the following procedure: start with an empty set, than add elements sequentially with `p[i] = {w[i] / sum(w[not taken]) if not taken, 0 otherwise}` - the result is random non-repeated numbers. – anymous.asker Aug 21 '19 at 22:08

2 Answers2

5

There is a nice way to solve this problem using augmented binary search trees. It gives an O(k log n)-time algorithm for sampling k elements at random.

The idea goes like this. Let's imagine that you stash all your elements in an array, in sorted order, with each element tagged with its weight. You could then solve this problem (inefficiently) as follows:

  1. Generate a random number between 0 and the total weight of all elements.
  2. Iterate over the array until you find an element such that the random number is in the "range" spanned by that element. Here, the "range" represents the window of weights from the start of that element to the start of the next element.
  3. Remove that element and repeat.

If you implement this as mentioned above, each pass of picking a random element will take time O(n): you have to iterate over all the elements of the array, then remove a single element somewhere once you've picked it. That's not great; the overall runtime is O(kn).

We can slightly improve upon this idea in the following way. When storing all the elements in the array, have each element store both its actual weight and the combined weight of all elements that come before it. Now, to find which element you're going to sample, you don't need to use a linear search. You can instead use a binary search over the array to locate your element in time O(log n). However, the overall runtime of this approach is still O(n) per iteration, since that's the cost of removing the element you picked, so we're still in O(kn) territory.

However, if you store the elements not in a sorted array where each element stores the weight of all elements before it, but in a balanced binary search tree where each element stores the weight of all elements in its left subtree, you can simulate the above algorithm (the binary search gets replaced with a walk over the tree). Moreover, this has the advantage that removing an element from the tree can be done in time O(log n), since it's a balanced BST.

(If you're curious how you'd do the walk to find the element that you want, do a quick search for "order statistics tree." The idea here is essentially a generalization of this idea.)

Following the advice from @dyukha, you can get O(log n) time per operation by building a perfectly-balanced tree from the items in time O(n) (the items don't actually have to be sorted for this technique to work - do you see why?), then using the standard tree deletion algorithm each time you need to remove something. This gives an overall solution runtime of O(k log n).

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • Oh, nice one! I had kind-of similar general idea, but I didn't think about balanced trees. I wanted to use binary search + fenwick tree, which is `O(log^2 n)`. –  Aug 21 '19 at 23:18
  • 1
    @anymous.asker, balanced trees can be a pain, but you can avoid it: you can work with unbalanced BST and add values to the tree in random order (so shuffle first, and then add). The resulting tree will be balanced with high probability. Another option is to just construct a perfectly balanced tree from the beginning. –  Aug 21 '19 at 23:39
  • @dyukha Oh, the idea to just use a perfectly-balanced tree from the start because you're only deleting things and therefore can't increase the height is a really good one! I'll edit the answer to include that. :-) – templatetypedef Aug 21 '19 at 23:48
  • 1
    @anymous.asker In the case when you don't need to update the vector of weights, it is better to store the "tree" in a flattened version - as a vector. You don't remove elements but temporarily set their weight to zero (and update sum of weights of all their parents each time you pick a sample integer; in the end you ought to restore initial values). – ALX23z Aug 22 '19 at 04:02
  • Consider providing pseudocode on how this idea can be implemented. Also, note that C++ includes `std::map`, which is the closest in spirit to a red-black tree in standard C++. – Peter O. Aug 22 '19 at 12:50
0

Putting the answers into code:

#include <iostream>
#include <vector>
#include <random>
#include <cmath>
#define pow2(n) ( 1 << (n) ) /* https://stackoverflow.com/questions/101439/the-most-efficient-way-to-implement-an-integer-based-power-function-powint-int */



int main()
{
    /* random and very biased set of weights */
    std::vector<double> weights{1, 1, 10000, 1, 30000, 1, 1, 500000};
    int rnd_max = weights.size();
    int ntake = 3;

    /* initialize random sampler */
    unsigned int seed = 12345;
    std::mt19937 rng(seed);

    /* determine smallest power of two that is larger than N */
    int tree_levels = ceil(log2((double) rnd_max));

    /* initialize vector with place-holders for perfectly-balanced tree */
    std::vector<double> tree_weights(pow2(tree_levels + 1));

    /* compute sums for the tree leaves at each node */
    int offset = pow2(tree_levels) - 1;
    for (int ix = 0; ix < rnd_max; ix++) {
        tree_weights[ix + offset] = weights[ix];
    }
    for (int ix = pow2(tree_levels+1) - 1; ix > 0; ix--) {
        tree_weights[(ix - 1) / 2] += tree_weights[ix];
    }

    /* sample according to uniform distribution */
    double rnd_subrange, w_left;
    double curr_subrange;
    int curr_ix;
    std::vector<int> sampled(ntake);
    for (int el = 0; el < ntake; el++) {

        /* go down the tree by drawing a random number and
           checking if it falls in the left or right sub-ranges */
        curr_ix = 0;
        curr_subrange = tree_weights[0];
        for (int lev = 0; lev < tree_levels; lev++) {
            rnd_subrange = std::uniform_real_distribution<double>(0, curr_subrange)(rng);
            w_left = tree_weights[2 * curr_ix + 1];
            curr_ix = 2 * curr_ix + 1 + (rnd_subrange >= w_left);
            curr_subrange = tree_weights[curr_ix];
        }

        /* finally, add element from this iteration */
        sampled[el] = curr_ix - offset;

        /* now remove the weight of the chosen element */
        tree_weights[curr_ix] = 0;
        for (int lev = 0; lev < tree_levels; lev++) {
            curr_ix = (curr_ix - 1) / 2;
            tree_weights[curr_ix] =   tree_weights[2 * curr_ix + 1]
                                    + tree_weights[2 * curr_ix + 2];
        }
    }

    std::cout << "sampled integers: [ ";
    for (int a : sampled) std::cout << a << " ";
    std::cout << "]" << std::endl;
    return 0;
}

Output as expected from the biased weights:

sampled integers: [ 7 4 2 ]

(Note that the time complexity is O(n [when building the tree with sums of nodes weights] + k * log2(n) [when sampling the elements]) - better than the naive O(n * k))

EDIT: updated answer to work also with potentially non-unique weights.

EDIT2: small changes for a more numerically-robust procedure.

anymous.asker
  • 1,179
  • 9
  • 14