How to efficiently generate a set of unique random numbers with a predefined distribution?

Question

I have a map of items with some probability distribution:

Map<SingleObjectiveItem, Double> itemsDistribution;

Given a certain m I have to generate a Set of m elements sampled from the above distribution.

As of now I was using the naive way of doing it:

while(mySet.size < m)
   mySet.add(getNextSample(itemsDistribution));

The getNextSample(...) method fetches an object from the distribution as per its probability. Now, as m increases the performance severely suffers. For m = 500 and itemsDistribution.size() = 1000 elements, there is too much thrashing and the function remains in the while loop for too long. Generate 1000 such sets and you have an application that crawls.

Is there a more efficient way to generate a unique set of random numbers with a "predefined" distribution? Most collection shuffling techniques and the like are uniformly random. What would be a good way to address this?

UPDATE: The loop will call getNextSample(...) "at least" 1 + 2 + 3 + ... + m = m(m+1)/2 times. That is in the first run we'll definitely get a sample for the set. The 2nd iteration, it may be called at least twice and so on. If getNextSample is sequential in nature, i.e., goes through the entire cumulative distribution to find the sample, then the run time complexity of the loop is at least: n*m(m+1)/2, 'n' is the number of elements in the distribution. If m = cn; 0<c<=1 then the loop is at least Sigma(n^3). And that too is the lower bound!

If we replace sequential search by binary search, the complexity would be at least Sigma(log n * n^2). Efficient but may not be by a large margin.

Also, removing from the distribution is not possible since I call the above loop k times, to generate k such sets. These sets are part of a randomized 'schedule' of items. Hence a 'set' of items.

Can a single element be picked more than once? If not, what is the exact formal meaning of the values in the map? It can't be just the probability to pick an element, since when we have already picked some elements and can not touch them again, the values lose certain properties of probabilities. Most obviously, they no longer sum up to 1. Furthermore, the order of picking items may interfere with the overall probability of picking a set. For example, from {1,2,3}, picking 1 then 2 could have different probability than picking 2 then 1 - probably you want consistency in this regard. — Gassa, Mar 15 '14 at 10:53

Ebbe M. Pedersen · Answer 1 · 2014-03-15T11:53:09.127

3

Start out by generating a number of random points in two dimentions.

enter image description here

Then apply your distribution

enter image description here

Now find all entries within the distribution and pick the x coordinates, and you have your random numbers with the requested distribution like this:

enter image description here

edited Mar 15 '14 at 11:53

answered Mar 15 '14 at 11:40

Ebbe M. Pedersen

7,250
3
27
47

Not sure this will be fast: Given a uniform distribution over the input elements, the chance of accepting a point is 1/n, i.e. on average, we will have to sample m*n points to get a set of size m. That's quite a lot. – meriton Mar 15 '14 at 13:04
Well - with a normal distribution and covering 3 standard diviations, approx 1/3 of the point would be under the graph. – Ebbe M. Pedersen Mar 15 '14 at 15:36
But other distributions are far worse than that. I just wanted to point out that rejection sampling requires a resonably small bounding box around the distribution, which does not exist for all distributions. That is, your answer will only work well for some distributions. – meriton Mar 15 '14 at 17:36

Spektre · Answer 2 · 2016-09-21T07:54:04.560

If you are not concerning with randomness properties too much then I do it like this:

create buffer for pseudo-random numbers

double buff[MAX]; // [edit1] double pseudo random numbers
- MAX is size should be big enough ... 1024*128 for example
- type can be any (float,int,DWORD...)
fill buffer with numbers

you have range of numbers x = < x0,x1 > and probability function probability(x) defined by your probability distribution so do this:
```
for (i=0,x=x0;x<=x1;x+=stepx)
 for (j=0,n=probability(x)*MAX,q=0.1*stepx/n;j<n;j++,i++) // [edit1] unique pseudo-random numbers
  buff[i]=x+(double(i)*q);                                // [edit1] ...
```
The stepx is your accuracy for items (for integral types = 1) now the buff[] array has the same distribution as you need but it is not pseudo-random. Also you should add check if j is not >= MAX to avoid array overruns and also at the end the real size of buff[] is j (can be less than MAX due to rounding)
shuffle buff[]

do just few loops of swap buff[i] and buff[j] where i is the loop variable and j is pseudo-random <0-MAX)
write your pseudo-random function

it just return number from the buffer. At first call returns the buff[0] at second buff[1] and so on ... For standard generators When you hit the end of buff[] then shuffle buff[] again and start from buff[0] again. But as you need unique numbers then you can not reach the end of buffer so so set MAX to be big enough for your task otherwise uniqueness will not be assured.

[Notes]

MAX should be big enough to store the whole distribution you want. If it is not big enough then items with low probability can be missing completely.

[edit1] - tweaked answer a little to match the question needs (pointed by meriton thanks)

PS. complexity of initialization is O(N) and for get number is O(1).

This appears incorrect: You don't ensure that the returned numbers are unique. Also, the weights may not be integers (the question uses double to represent them). — meriton, Mar 15 '14 at 13:14
@meriton ow I overlooked the unique part before ... but the numbers can be double as I wrote there. the buff fill needs just a little tweak to get unique numbers only (add some small accuracy drift to it while filling < stepx/probability(x))) but of course you cannot use the buffer twice then so you are limited by use the rnd only MAX times — Spektre, Mar 15 '14 at 19:48

meriton · Accepted Answer · 2014-03-15T18:10:46.793

The problem is unlikely to be the loop you show:

Let n be the size of the distribution, and I be the number of invocations to getNextSample. We have I = sum_i(C_i), where C_i is the number of invocations to getNextSample while the set has size i. To find E[C_i], observe that C_i is the inter-arrival time of a poisson process with λ = 1 - i / n, and therefore exponentially distributed with λ. Therefore, E[C_i] = 1 / λ = therefore E[C_i] = 1 / (1 - i / n) <= 1 / (1 - m / n). Therefore, E[I] < m / (1 - m / n).

That is, sampling a set of size m = n/2 will take, on average, less than 2m = n invocations of getNextSample. If that is "slow" and "crawls", it is likely because getNextSample is slow. This is actually unsurprising, given the unsuitable way the distrubution is passed to the method (because the method will, of necessity, have to iterate over the entire distribution to find a random element).

The following should be faster (if m < 0.8 n)

class Distribution<T> {
    private double[] cummulativeWeight;
    private T[] item;
    private double totalWeight;

    Distribution(Map<T, Double> probabilityMap) {
        int i = 0;

        cummulativeWeight = new double[probabilityMap.size()];
        item = (T[]) new Object[probabilityMap.size()];

        for (Map.Entry<T, Double> entry : probabilityMap.entrySet()) {
            item[i] = entry.getKey();
            totalWeight += entry.getValue();
            cummulativeWeight[i] = totalWeight;
            i++;
        }
    }

    T randomItem() {
        double weight = Math.random() * totalWeight;
        int index = Arrays.binarySearch(cummulativeWeight, weight);
        if (index < 0) {
            index = -index - 1;
        }
        return item[index];
    }

    Set<T> randomSubset(int size) {
        Set<T> set = new HashSet<>();
        while(set.size() < size) {
            set.add(randomItem());
        }
        return set;
    }
}



public class Test {

    public static void main(String[] args) {
        int max = 1_000_000;
        HashMap<Integer, Double> probabilities = new HashMap<>();
        for (int i = 0; i < max; i++) {
            probabilities.put(i, (double) i);
        }

        Distribution<Integer> d = new Distribution<>(probabilities);
        Set<Integer> set = d.randomSubset(max / 2);
        //System.out.println(set);
    }
}

The expected runtime is O(m / (1 - m / n) * log n). On my computer, a subset of size 500_000 of a set of 1_000_000 is computed in about 3 seconds.

As we can see, the expected runtime approaches infinity as m approaches n. If that is a problem (i.e. m > 0.9 n), the following more complex approach should work better:

Set<T> randomSubset(int size) {
    Set<T> set = new HashSet<>();
    while(set.size() < size) {
        T randomItem = randomItem();
            remove(randomItem); // removes the item from the distribution
            set.add(randomItem);
    }
    return set;
}

To efficiently implement remove requires a different representation for the distribution, for instance a binary tree where each node stores the total weight of the subtree whose root it is.

But that is rather complicated, so I wouldn't go that route if m is known to be significantly smaller than n.

How's it `1/(1-c)`? If `getNextSample(...)` runs in `O(n)` (sequential, unfortunately) then the loop would be expected to run `1 + 2 + 3 + ... + m = m(m+1)/2`. If `m = cn` that's easily `O(n^2)`. A binary search would definitely make it more efficient. But I don't think it'll be a wide margin. Since the expected value of the loop would still be `m(m+1)/2`. Am I missing something here? — PhD, Mar 15 '14 at 16:09
I added a proof sketch for that runtime analysis. Also, n denotes the size of the collection we sample from. Therefore, E[execution time of your algorithm] = O(E[I] * n) = O(m / (1 - c) * n). That said, why you think that replacing a linear search by a binary search will not yield a significant improvement in execution time is beyond me. — meriton, Mar 15 '14 at 17:32
Oh it will. I'm questioning if the speed up would be really significant. I'm working on changing it to a bSearch to see how well it works. A quick question: Why do you assume an exponential distribution? Conceptually speaking... — PhD, Mar 15 '14 at 17:48
Well, if n = 1000, n / log(n) ~= 100 ... I'll edit about why the distribution is exponential. — meriton, Mar 15 '14 at 17:58
You were right. It's not "only" the loop that was the problem but also the way the data was captured. I took your suggestion and created a similar class to do this. It did speed up substantially. Only changing the loop to do a binary search didn't help much as I had guessed. — PhD, Mar 16 '14 at 03:48

score 0 · Answer 4 · answered Mar 15 '14 at 08:44

You should implement your own random number generator (using a MonteCarlo methode or any good uniform generator like mersen twister) and basing on the inversion method (here).

For example : exponential law: generate a uniform random number u in [0,1] then your random variable of the exponential law would be : ln(1-u)/(-lambda) lambda being the exponential law parameter and ln the natural logarithm.

Hope it'll help ;).

score 0 · Answer 5 · answered Mar 15 '14 at 09:41

I think you have two problems:

Your itemDistribution doesn't know you need a set, so when the set you are building gets large you will pick a lot of elements that are already in the set. If you start with the set all full and remove elements you will run into the same problem for very small sets.

Is there a reason why you don't remove the element from the itemDistribution after you picked it? Then you wouldn't pick the same element twice?
The choice of datastructure for itemDistribution looks suspicious to me. You want the getNextSample operation to be fast. Doesn't the map from values to probability force you to iterate through large parts of the map for each getNextSample. I'm no good at statistics but couldn't you represent the itemDistribution the other way, like a map from probability, or maybe the sum of all smaller probabilities + probability to a element of the set?

score 0 · Answer 6 · answered Mar 15 '14 at 10:43

Your performance depends on how your getNextSample function works. If you have to iterate over all probabilities when you pick the next item, it might be slow.

A good way to pick several unique random items from a list is to first shuffle the list and then pop items off the list. You can shuffle the list once with the given distribution. From then on, picking your m items ist just popping the list.

Here's an implementation of a probabilistic shuffle:

List<Item> prob_shuffle(Map<Item, int> dist)
{
    int n = dist.length;
    List<Item> a = dist.keys();
    int psum = 0;
    int i, j;

    for (i in dist) psum += dist[i];

    for (i = 0; i < n; i++) {
        int ip = rand(psum);    // 0 <= ip < psum
        int jp = 0;

        for (j = i; j < n; j++) {
            jp += dist[a[j]];
            if (ip < jp) break;
        }

        psum -= dist[a[j]];

        Item tmp = a[i];
        a[i] = a[j];
        a[j] = tmp;
    }
    return a;
}

This in not Java, but pseudocude after an implementation in C, so please take it with a grain of salt. The idea is to append items to the shuffled area by continuously picking items from the unshuffled area.

Here, I used integer probabilities. (The proabilities don't have to add to a special value, it's just "bigger is better".) You can use floating-point numbers but because of inaccuracies, you might end up going beyond the array when picking an item. You should use item n - 1 then. If you add that saftey net, you could even have items with zero probability that always get picked last.

There might be a method to speed up the picking loop, but I don't really see how. The swapping renders any precalculations useless.

score 0 · Answer 7 · answered Mar 15 '14 at 16:19

Accumulate your probabilities in a table

               Probability
Item       Actual  Accumulated
Item1       0.10      0.10
Item2       0.30      0.40
Item3       0.15      0.55
Item4       0.20      0.75
Item5       0.25      1.00

Make a random number between 0.0 and 1.0 and do a binary search for the first item with a sum that is greater than your generated number. This item would have been chosen with the desired probability.

score 0 · Answer 8 · answered Mar 15 '14 at 16:37

Ebbe's method is called rejection sampling.

I sometimes use a simple method, using an inverse cumulative distribution function, which is a function that maps a number X between 0 and 1 onto the Y axis. Then you just generate a uniformly distributed random number between 0 and 1, and apply the function to it. That function is also called the "quantile function".

For example, suppose you want to generate a normally distributed random number. It's cumulative distribution function is called Phi. The inverse of that is called probit. There are many ways to generate normal variates, and this is just one example.

You can easily construct an approximate cumulative distribution function for any univariate distribution you like, in the form of a table. Then you can just invert it by table-lookup and interpolation.

How to efficiently generate a set of unique random numbers with a predefined distribution?

8 Answers8

Linked