How to generate a list of ascending random integers

Question

I have an external collection containing n elements that I want to select some number (k) of them at random, outputting the indices of those elements to some serialized data file. I want the indices to be output in strict ascending order, and for there to be no duplicates. Both n and k may be quite large, and it is generally not feasible to simply store entire arrays in memory of that size.

The first algorithm I came up with was to pick a random number r[0] from 1 to n-k... and then pick a successive random numbers r[i] from r[i-1]+1 to n-k+i, only needing to store two entries for 'r' at any one time. However, a fairly simple analysis reveals the the probability for selecting small numbers is inconsistent with what could have been if the entire set was equally distributed. For example, if n was a billion and k was half a billion, the probability of selecting the first entry with the approach I've just described is very tiny (1 in half a billion), where in actuality since half of the entries are being selected, the first should be selected 50% of the time. Even if I use external sorting to sort k random numbers, I would have to discard any duplicates, and try again. As k approaches n, the number of retries would continue to grow, with no guarantee of termination.

I would like to find a O(k) or O(k log k) algorithm to do this, if it is at all possible. The implementation language I will be using is C++11, but descriptions in pseudocode may still be helpful.

Generate the random integers as usual (using `std::mt19937` and a `std::uniform_int_distribution` for example) and store the results in a `std::set` such that there are no duplicates and the resulting container is sorted inherently. — sjrowlinson, Jun 07 '16 at 22:44
Is it always necessary to select exactly k elements? Or is it acceptable for the average count of many runs to tend toward k? If the latter, then simply add RND(0, 2n/k) to each previous entry until you reach the end of the list. — Ryan Bemrose, Jun 07 '16 at 22:57
Always ascending. No storing. No duplication. That's a hard thing to do. I'm going to have to think on whether or not this is even possible. — user4581301, Jun 07 '16 at 23:41
See [J.S. Vitter, An Efficient Algorithm for Random Sampling, 1987](http://www.ittc.ku.edu/~jsv/Papers/catalog/5SAMPLING_HISTOGRAMS.html#Vit87:RandomSampling). It's not hard to find implementations through a web search. — rici, Jun 08 '16 at 04:18

Vlad Shcherbina · Answer 1 · 2016-06-08T21:18:39.150

4

If in practice k has the same order of magnitude as n, perhaps very straightforward O(n) algorithm will suffice:

assert(k <= n);
std::uniform_real_distribution rnd;
for (int i = 0; i < n; i++) {
    if (rnd(engine) * (n - i) < k) {
        std::cout << i << std::endl;
        k--;
    }
}

It produces all ascending sequences with equal probability.

edited Jun 08 '16 at 21:18

answered Jun 07 '16 at 23:53

Vlad Shcherbina

179
1
6

How do you guarantee that this selects exactly `k` items? – Jordan Melo Jun 08 '16 at 20:38
1

Thank you, I noticed a mistake while answering (should be `rnd * (n - i) < k` instead of `<=`). It can't output more than k items, because if remaining k = 0, the branch is never triggered. It can't output less than k items, because when n - i = k (last chance to generate enough items in time), the branch is guaranteed to be triggered. – Vlad Shcherbina Jun 08 '16 at 21:18

sjrowlinson · Answer 2 · 2016-06-08T21:24:40.507

2

As mentioned in my comment, use a std::set<int> to store the randomly generated integers such that the resulting container is inherently sorted and contains no duplicates. Example code snippet:

#include <random>
#include <set>

int main(void) {
    std::set<int> random_set;
    std::random_device rd;
    std::mt19937 mt_eng(rd());
    // min and max of random set range
    const int m = 0; // min
    const int n = 100; // max
    std::uniform_int_distribution<> dist(m,n);

    // number to generate
    const int k = 50;
    for (int i = 0; i < k; ++i) {
        // only non-previously occurring values will be inserted
        if (!random_set.insert(dist(mt_eng)).second)
            --i;
    }
}

edited Jun 08 '16 at 21:24

answered Jun 07 '16 at 22:56

sjrowlinson

3,297
1
18
35

1

This doesn't seem to guarantee that random_set will contain 50 elements... What's the difference to the second algorithm OP is describing? – Stefan Haustein Jun 07 '16 at 23:25
@StefanHaustein Fixed the first issue. – sjrowlinson Jun 07 '16 at 23:35
1

That's a good `k log k` solution. You could keep the variable naming consistent with question. I believe your `max` is `n` and `n` is `k`. – luk32 Jun 07 '16 at 23:57
1

Theoretically could this run forever if it kept generating the same random number in the loop? – sashang Jun 08 '16 at 04:29
@sashang Technically yes, and this would be more of a problem for small ranges of generation - however, even in such a case the probability of continuously generating the same number over and over again would approach zero over longer times. In other words: is it possible? Yes. Is it probable? No. – sjrowlinson Jun 08 '16 at 13:52

samgak · Accepted Answer · 2016-06-09T23:03:32.137

You can solve this recursively in O(k log k) if you partition in the middle of your range, and randomly sample from the hypergeometric probability distribution to choose how many values lie above and below the middle point (i.e. the values of k for each subsequence), then recurse for each:

int sample_hypergeometric(int n, int K, int N) // samples hypergeometric distribution and
// returns number of "successes" where there are n draws without replacement from
// a population of N with K possible successes.
// Something similar to scipy.stats.hypergeom.rvs in Python.
// In this case, "success" means the selected value lying below the midpoint. 
{
     std::default_random_engine generator;
     std::uniform_real_distribution<double> distribution(0.0,1.0);

     int successes = 0;
     for(int trial = 0; trial < n; trial++)
     {
         if((int)(distribution(generator) * N) < K)
         {
             successes++;
             K--;
         }
         N--;
     }
     return successes;
}

select_k_from_n(int start, int k, int n)
{
    if(k == 0)
        return;
    if(k == 1)
    {
        output start + random(1 to n);
        return;
    }

    // find the number of results below the mid-point:
    int k1 = sample_hypergeometric(k, n >> 1, n);
    select_k_from_n(start, k1, n >> 1);
    select_k_from_n(start + (n >> 1), k - k1, n - (n >> 1));
}

Sampling from the binomial distribution could also be used to approximate the hypergeometric distribution with p = (n >> 1) / n, rejecting samples where k1 > (n >> 1).

I am sorry, but I do not have any idea how to generate random numbers in a hypergeometric probability distribution. Would you be able to elaborate on this post by defining sample_hypergeometric in terms of either a uniform distribution, or else in terms of one of the other already-existing random number distributions in C++11 (http://en.cppreference.com/w/cpp/numeric/random)? Thanks so much. — markt1964, Jun 09 '16 at 21:06
@markt1964 I've added some code for the random number generation (untested) — samgak, Jun 09 '16 at 23:05
Thanks. Is it possible to define the sample_hypergeometric using closed form functions only, or does it require that for-loop? — markt1964, Jun 11 '16 at 06:04
I don't think a closed form solution is possible, but there are faster implementations, check out the numpy source: https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/distributions.c (search for `rk_hypergeometric`) — samgak, Jun 11 '16 at 07:49

Jim Mischel · Answer 4 · 2016-06-08T01:12:10.947

1

Assuming that you can't store k random numbers in memory, you'll have to generate the numbers in strict random order. One way to do it would be to generate a number between 0 and n/k. Call that number x. The next number you have to generate is between x+1 and (n-x)/(k-1). Continue in that fashion until you've selected k numbers.

Basically, you're dividing the remaining range by the number of values left to generate, and then generating a number in the first section of that range.

An example. You want to generate 3 numbers between 0 and 99, inclusive. So you first generate a number between 0 and 33. Say you pick 10.

So now you need a number between 11 and 99. The remaining range consists of 89 values, and you have two values left to pick. So, 89/2 = 44. You need a number between 11 and 54. Say you pick 36.

Your remaining range is from 37 to 99, and you have one number left to choose. So pick a number at random between 37 and 99.

This won't give you a normal distribution, as once you choose a number it's impossible to get a number less than that in a subsequent choice. But it might be good enough for your purposes.

This pseudocode shows the basic idea.

pick_k_from_n(n, k)
{
    num_left = k
    last_k = 0;
    while num_left > 0
    {
        // divide the remaining range into num_left partitions
        range_size = (n - last_k) / num_left
        // pick a number in the first partition
        r = random(range_size) + last_k + 1
        output(r)
        last_k = r
        num_left = num_left - 1
    }
}

Note that this takes O(k) time and requires O(1) extra space.

edited Jun 08 '16 at 01:12

answered Jun 07 '16 at 23:11

Jim Mischel

131,090
20
188
351

What do you do when x[i] == n before i=k? – user4581301 Jun 07 '16 at 23:13
Wouldn't this make a selection impossible where no index is below 33 (for your example) -- instead of just less likely? – Stefan Haustein Jun 07 '16 at 23:14
OP wants strict ordering order. This will provide it, at the noted cost of a skewed distribution, but it will fail if you pick the last number before the end of selection. – user4581301 Jun 07 '16 at 23:43
@user4581301 But you can't pick the last item before the end of selection. The next number you pick will always be in the range `[start_of_remaining..(remaining_range/number_left_to_pick)]`. – Jim Mischel Jun 08 '16 at 01:02
@StefanHaustein The OP said select indices from 1 to n, where `r[k] > r[k-1]`. He didn't say anything about any missing indexes. – Jim Mischel Jun 08 '16 at 01:06
My apologies. Somehow I didn't pick that up from the last time even though it was similar to what I was thinking--subdivide into discrete segments and RNG inside that segment. Your description is good, my reading wasn't. Maybe the difference between what I had in mind caused me to reject it. – user4581301 Jun 08 '16 at 01:24

Stefan Haustein · Answer 5 · 2016-06-07T23:45:28.457

Could you adjust each ascending index selection in a way that compensates for the probability distortion you are describing?

IANAS, but my guess would be that if you pick a random number r between 0 and 1 (that you'll scale to the full remaining index range after the adjustment), you might be able to adjust it by calculating r^(x) (keeping the range in 0..1, but increasing the probability of smaller numbers), with x selected by solving the equation for the probability of the first entry?

score 0 · Answer 6 · answered Jun 07 '16 at 23:40

You can do it in O(k) time with Floyd's algorithm (not Floyd-Warshall, that's a shortest path thing). The only data structure you need is a 1-bit table that will tell you whether or not a number has already been selected. Searching a hash table can be O(1), so this will not be a burden, and can be kept in memory even for very large n (if n is truly huge, you'll have to use a b-tree or bloom filter or something).

To select k items from among n:

for j = n-k+1 to n:
  select random x from 1 to j
  if x is already in hash:
    insert j into hash
  else
    insert x into hash

That's it. At the end, your hash table will contain a uniformly selected sample of k items from among n. Read them out in order (you may have to pick a type of hash table that allows that).

Nice idea, although a Bloom filter won't work due to false positives. — Jim Mischel, Jun 08 '16 at 01:33
Yes, if the uniqueness constraint isn't strict, it might be useful. — Lee Daniel Crocker, Jun 08 '16 at 02:07

score 0 · Answer 7 · edited May 23 '17 at 11:52

Here's an O(k log k + √n)-time algorithm that uses O(√n) words of space. This can be generalized to an O(k + n^(1/c))-time, O(n^(1/c))-space algorithm for any integer constant c.

For intuition, imagine a simple algorithm that uses (e.g.) Floyd's sampling algorithm to generate k of n elements and then radix sorts them in base √n. Instead of remembering what the actual samples are, we'll do a first pass where we run a variant of Floyd's where we remember only the number of samples in each bucket. The second pass is, for each bucket in order, to randomly resample the appropriate number of elements from the bucket range. There's a short proof involving conditional probability that this gives a uniform distribution.

# untested Python code for illustration
# b is the number of buckets (e.g., b ~ sqrt(n))
import random
def first_pass(n, k, b):
    counts = [0] * b  # list of b zeros
    for j in range(n - k, n):
        t = random.randrange(j + 1)
        if t // b >= counts[t % b]:  # intuitively, "t is not in the set"
            counts[t % b] += 1
        else:
            counts[j % b] += 1
    return counts

How to generate a list of ascending random integers

7 Answers7