29

The question gives all necessary data: what is an efficient algorithm to generate a sequence of K non-repeating integers within a given interval [0,N-1]. The trivial algorithm (generating random numbers and, before adding them to the sequence, looking them up to see if they were already there) is very expensive if K is large and near enough to N.

The algorithm provided in Efficiently selecting a set of random elements from a linked list seems more complicated than necessary, and requires some implementation. I've just found another algorithm that seems to do the job fine, as long as you know all the relevant parameters, in a single pass.

Community
  • 1
  • 1
tucuxi
  • 17,561
  • 2
  • 43
  • 74
  • Wait, if you already found another algorithm, what's the question? – Dark Shikari Oct 01 '08 at 17:22
  • 1
    such a neat algorithm! had to share it with someone - and it seems to be recommended behavior according to the http://stackoverflow.com/faq: "It's also perfectly fine to ask and answer your own programming question, but pretend you're on Jeopardy – tucuxi Oct 01 '08 at 17:52
  • 1
    Answer to this looks the best to me. http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-random-combination-of-values – Fakrudeen Mar 22 '10 at 10:16
  • @tucuxi I got a carte blanche to narrow the scope at http://meta.stackoverflow.com/questions/334325/a-few-intersecting-questions-about-picking-k-elements-of-n . Admittedly, I should have mentioned this in edit summary. – ivan_pozdeev Nov 05 '16 at 21:38

13 Answers13

13

In The Art of Computer Programming, Volume 2: Seminumerical Algorithms, Third Edition, Knuth describes the following selection sampling algorithm:

Algorithm S (Selection sampling technique). To select n records at random from a set of N, where 0 < n ≤ N.

S1. [Initialize.] Set t ← 0, m ← 0. (During this algorithm, m represents the number of records selected so far, and t is the total number of input records that we have dealt with.)

S2. [Generate U.] Generate a random number U, uniformly distributed between zero and one.

S3. [Test.] If (N – t)U ≥ n – m, go to step S5.

S4. [Select.] Select the next record for the sample, and increase m and t by 1. If m < n, go to step S2; otherwise the sample is complete and the algorithm terminates.

S5. [Skip.] Skip the next record (do not include it in the sample), increase t by 1, and go back to step S2.

An implementation may be easier to follow than the description. Here is a Common Lisp implementation that select n random members from a list:

(defun sample-list (n list &optional (length (length list)) result)
  (cond ((= length 0) result)
        ((< (* length (random 1.0)) n)
         (sample-list (1- n) (cdr list) (1- length)
                      (cons (car list) result)))
        (t (sample-list n (cdr list) (1- length) result))))

And here is an implementation that does not use recursion, and which works with all kinds of sequences:

(defun sample (n sequence)
  (let ((length (length sequence))
        (result (subseq sequence 0 n)))
    (loop
       with m = 0
       for i from 0 and u = (random 1.0)
       do (when (< (* (- length i) u) 
                   (- n m))
            (setf (elt result m) (elt sequence i))
            (incf m))
       until (= m n))
    result))
Community
  • 1
  • 1
Vebjorn Ljosa
  • 17,438
  • 13
  • 70
  • 88
  • 1
    Thanks for the authoritative answer. I have the same requirement, and this is the algo I'm planning to implement. Thanks again. – Sundar R Nov 04 '09 at 20:24
12

The random module from Python library makes it extremely easy and effective:

from random import sample
print sample(xrange(N), K)

sample function returns a list of K unique elements chosen from the given sequence.
xrange is a "list emulator", i.e. it behaves like a list of consecutive numbers without creating it in memory, which makes it super-fast for tasks like this one.

Dzinx
  • 55,586
  • 10
  • 60
  • 78
  • 6
    The python implementation is quite nice (see http://svn.python.org/view/python/trunk/Lib/random.py?view=markup, search for "sample"). They distinguish two cases, one for large K (K near N) and one for small K. For large K, they selectively copy elements over. For small K, they draw elements randomly, avoiding repetitions using a set. – tucuxi Mar 30 '10 at 02:45
  • 3
    This is inefficient in memory for large sequences. – Jonathan Hartley Aug 12 '15 at 11:30
  • https://hg.python.org/cpython/file/tip/Lib/random.py is the new source link. – ivan_pozdeev Sep 07 '16 at 17:30
  • Why not just [`random.shuffle`](https://docs.python.org/3/library/random.html#random.shuffle)? – Tobias Kienzler Dec 01 '16 at 09:19
  • Answer lacks an explanation - see Jonathans Hartley's comment. – Imago Sep 23 '17 at 19:10
5

It is actually possible to do this in space proportional to the number of elements selected, rather than the size of the set you're selecting from, regardless of what proportion of the total set you're selecting. You do this by generating a random permutation, then selecting from it like this:

Pick a block cipher, such as TEA or XTEA. Use XOR folding to reduce the block size to the smallest power of two larger than the set you're selecting from. Use the random seed as the key to the cipher. To generate an element n in the permutation, encrypt n with the cipher. If the output number is not in your set, encrypt that. Repeat until the number is inside the set. On average you will have to do less than two encryptions per generated number. This has the added benefit that if your seed is cryptographically secure, so is your entire permutation.

I wrote about this in much more detail here.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
  • Nice article. But, doesn't "XOR folding" destroy uniqueness? Sure, x != y implies encipher(x) != encipher(y) for decoding to work, but using e.g. (encipher(x) >> 4) ^ (encipher(x) & MASK) instead could "collapse" different x values to the same code -- so your "permutation" might contain repeats. – j_random_hacker Jan 21 '09 at 12:58
  • I don't have the theoretical basis to hand, but no, it doesn't destroy the 1-to-1 mapping properties of the block cipher. Xor folding is taken from the TEA cipher - perhaps check references on that for more detail. – Nick Johnson Jan 23 '09 at 19:27
  • @j_random_hacker: Of course, you're right. But it's nevertheless possible to come up with a pseudo random permutation using a custom Feistel cipher using the some cryptographic hash function as function F. – sellibitze Jun 22 '10 at 14:07
  • see here: http://stackoverflow.com/questions/196017/unique-random-numbers-in-o1/3094476#3094476 – sellibitze Jun 22 '10 at 15:13
  • For anyone reading this today, while this method sounds like it could be better, the `sample` method from `random` used with `range` is (in my experiments) actually faster than TEA even if you only use a single cycle. Also, I did occasionally get duplicates when using only `v0` as the output. For that experiment, I created a TEA based number generator and initialized and computed 10.000 sets of 2048 numbers and had 6 cases where it generated a duplicate. Maybe multiple cycles would help but even for one cycle it is already slower than `random.sample` which also guarantees unique numbers. – Stefan Fabian Feb 13 '21 at 11:30
3

The following code (in C, unknown origin) seems to solve the problem extremely well:

 /* generate N sorted, non-duplicate integers in [0, max] */
 int *generate(int n, int max) {
    int i, m, a;    
    int *g = (int *)calloc(n, sizeof(int));
    if (!g) return 0;

    m = 0;
    for (i = 0; i < max; i++) {
        a = random_in_between(0, max - i);
        if (a < n - m) {
            g[m] = i;
            m++;
        }
    }
    return g;
 }

Does anyone know where I can find more gems like this one?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
tucuxi
  • 17,561
  • 2
  • 43
  • 74
  • 3
    Programming Pearls by Jon Bentley (the pun on "gems" was intentional). :) – Bill the Lizard Oct 01 '08 at 17:30
  • What does "random_in_between" stands for? – Luis Filipe Oct 01 '08 at 17:30
  • 2
    This algorithm is terribly inefficient for small sample chosen from a large set. Picking 5 integers from a million takes one million calls to rand() instead of 5. – Rafał Dowgird Oct 01 '08 at 17:35
  • Thanks for the booktitle - I couldn't think of any other way to find it. Luis, random_in_between is for 'number between lo and hi, not including hi'. Praptak, perfectly true. Should have specified 'memory efficiency' versus 'time efficiency'. At least it's guaranteed to finish in bounded time... – tucuxi Oct 01 '08 at 17:56
  • This is the Knuth's algorithm also described in [another answer](http://stackoverflow.com/a/187952/648265). – ivan_pozdeev Sep 05 '16 at 17:00
  • @RafałDowgird this could be improved by adding an early exit: `for (i=0; i – ivan_pozdeev Sep 07 '16 at 17:06
  • Note this answer returns a sorted list, which was not a requirement of the initial question. – cmcginty Sep 13 '16 at 00:44
2

Generate an array 0...N-1 filled a[i] = i.

Then shuffle the first K items.

Shuffling:

  • Start J = N-1
  • Pick a random number 0...J (say, R)
  • swap a[R] with a[J]
    • since R can be equal to J, the element may be swapped with itself
  • subtract 1 from J and repeat.

Finally, take K last elements.

This essentially picks a random element from the list, moves it out, then picks a random element from the remaining list, and so on.

Works in O(K) and O(N) time, requires O(N) storage.

The shuffling part is called Fisher-Yates shuffle or Knuth's shuffle, described in the 2nd volume of The Art of Computer Programming.

ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
James Curran
  • 101,701
  • 37
  • 181
  • 258
1

Speed up the trivial algorithm by storing the K numbers in a hashing store. Knowing K before you start takes away all the inefficiency of inserting into a hash map, and you still get the benefit of fast look-up.

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
  • 1
    Yeah, that the way I did it when I needed 10M nonrepeating random numbers for a lottery – axk Oct 01 '08 at 17:27
  • Not too memory-efficient - need a K-sized auxiliary structure. In time, you need K insertions and N removals. The algorithm I found needs only (at most) K random draws. – tucuxi Oct 01 '08 at 17:39
  • You don't need an auxiliary structure at all. Just make the map your only structure. You'll always need K insertions to store K items. Why do you need N removals? – Bill the Lizard Oct 01 '08 at 18:05
  • Inserting into and checking the K-sized data structure isn't where problem with the trivial algo is, it's that as K -> N, your RNG will have a very high probability of generating a number you've already seen before when filling the end of the sequence. You need a hash map, but thats auxilliary. – Greg Rogers Oct 07 '08 at 18:27
1

My solution is C++ oriented, but I'm sure it could be translated to other languages since it's pretty simple.

  • First, generate a linked list with K elements, going from 0 to K
  • Then as long as the list isn't empty, generate a random number between 0 and the size of the vector
  • Take that element, push it into another vector, and remove it from the original list

This solution only involves two loop iterations, and no hash table lookups or anything of the sort. So in actual code:

// Assume K is the highest number in the list
std::vector<int> sorted_list;
std::vector<int> random_list;

for(int i = 0; i < K; ++i) {
    sorted_list.push_back(i);
}

// Loop to K - 1 elements, as this will cause problems when trying to erase
// the first element
while(!sorted_list.size() > 1) {
    int rand_index = rand() % sorted_list.size();
    random_list.push_back(sorted_list.at(rand_index));
    sorted_list.erase(sorted_list.begin() + rand_index);
}                 

// Finally push back the last remaining element to the random list
// The if() statement here is just a sanity check, in case K == 0
if(!sorted_list.empty()) {
    random_list.push_back(sorted_list.at(0));
}
Nik Reiman
  • 39,067
  • 29
  • 104
  • 160
1

Step 1: Generate your list of integers.
Step 2: Perform Knuth Shuffle.

Note that you don't need to shuffle the entire list, since the Knuth Shuffle algorithm allows you to apply only n shuffles, where n is the number of elements to return. Generating the list will still take time proportional to the size of the list, but you can reuse your existing list for any future shuffling needs (assuming the size stays the same) with no need to preshuffle the partially shuffled list before restarting the shuffling algorithm.

The basic algorithm for Knuth Shuffle is that you start with a list of integers. Then, you swap the first integer with any number in the list and return the current (new) first integer. Then, you swap the second integer with any number in the list (except the first) and return the current (new) second integer. Then...etc...

This is an absurdly simple algorithm, but be careful that you include the current item in the list when performing the swap or you will break the algorithm.

Brian
  • 25,523
  • 18
  • 82
  • 173
0

This Ruby code showcases the Reservoir Sampling, Algorithm R method. In each cycle, I select n=5 unique random integers from [0,N=10) range:

t=0
m=0
N=10
n=5
s=0
distrib=Array.new(N,0)
for i in 1..500000 do
 t=0
 m=0
 s=0
 while m<n do

  u=rand()
  if (N-t)*u>=n-m then
   t=t+1
  else 
   distrib[s]+=1
   m=m+1
   t=t+1
  end #if
  s=s+1
 end #while
 if (i % 100000)==0 then puts i.to_s + ". cycle..." end
end #for
puts "--------------"
puts distrib

output:

100000. cycle...
200000. cycle...
300000. cycle...
400000. cycle...
500000. cycle...
--------------
250272
249924
249628
249894
250193
250202
249647
249606
250600
250034

all integer between 0-9 were chosen with nearly the same probability.

It's essentially Knuth's algorithm applied to arbitrary sequences (indeed, that answer has a LISP version of this). The algorithm is O(N) in time and can be O(1) in memory if the sequence is streamed into it as shown in @MichaelCramer's answer.

Community
  • 1
  • 1
Konstantin
  • 2,983
  • 3
  • 33
  • 55
  • You should be measuring the probability of each complete permutation instead of individual numbers to actually show the method's quality - otherwise, you only show the randomness of number set selection, not of their order. – ivan_pozdeev Nov 25 '16 at 16:59
0

The Reservoir Sampling version is pretty simple:

my $N = 20;
my $k;
my @r;

while(<>) {
  if(++$k <= $N) {
    push @r, $_;
  } elsif(rand(1) <= ($N/$k)) {
    $r[rand(@r)] = $_;
  }
}

print @r;

That's $N randomly selected rows from STDIN. Replace the <>/$_ stuff with something else if you're not using rows from a file, but it's a pretty straightforward algorithm.

Michael Cramer
  • 5,080
  • 1
  • 20
  • 16
0

If the list is sorted, for example, if you want to extract K elements out of N, but you do not care about their relative order, an efficient algorithm is proposed in the paper An Efficient Algorithm for Sequential Random Sampling (Jeffrey Scott Vitter, ACM Transactions on Mathematical Software, Vol. 13, No. 1, March 1987, Pages 56-67.).

edited to add the code in c++ using boost. I've just typed it and there might be many errors. The random numbers come from the boost library, with a stupid seed, so don't do anything serious with this.

/* Sampling according to [Vitter87].
 * 
 * Bibliography
 * [Vitter 87]
 *   Jeffrey Scott Vitter, 
 *   An Efficient Algorithm for Sequential Random Sampling
 *   ACM Transactions on MAthematical Software, 13 (1), 58 (1987).
 */

#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <string>
#include <iostream>

#include <iomanip>

#include <boost/random/linear_congruential.hpp>
#include <boost/random/variate_generator.hpp>
#include <boost/random/uniform_real.hpp>

using namespace std;

// This is a typedef for a random number generator.
// Try boost::mt19937 or boost::ecuyer1988 instead of boost::minstd_rand
typedef boost::minstd_rand base_generator_type;

    // Define a random number generator and initialize it with a reproducible
    // seed.
    // (The seed is unsigned, otherwise the wrong overload may be selected
    // when using mt19937 as the base_generator_type.)
    base_generator_type generator(0xBB84u);
    //TODO : change the seed above !
    // Defines the suitable uniform ditribution.
    boost::uniform_real<> uni_dist(0,1);
    boost::variate_generator<base_generator_type&, boost::uniform_real<> > uni(generator, uni_dist);



void SequentialSamplesMethodA(int K, int N) 
// Outputs K sorted random integers out of 0..N, taken according to 
// [Vitter87], method A.
    {
    int top=N-K, S, curr=0, currsample=-1;
    double Nreal=N, quot=1., V;

    while (K>=2)
        {
        V=uni();
        S=0;
        quot=top/Nreal;
        while (quot > V)
            {
            S++; top--; Nreal--;
            quot *= top/Nreal;
            }
        currsample+=1+S;
        cout << curr << " : " << currsample << "\n";
        Nreal--; K--;curr++;
        }
    // special case K=1 to avoid overflow
    S=floor(round(Nreal)*uni());
    currsample+=1+S;
    cout << curr << " : " << currsample << "\n";
    }

void SequentialSamplesMethodD(int K, int N)
// Outputs K sorted random integers out of 0..N, taken according to 
// [Vitter87], method D. 
    {
    const int negalphainv=-13; //between -20 and -7 according to [Vitter87]
    //optimized for an implementation in 1987 !!!
    int curr=0, currsample=0;
    int threshold=-negalphainv*K;
    double Kreal=K, Kinv=1./Kreal, Nreal=N;
    double Vprime=exp(log(uni())*Kinv);
    int qu1=N+1-K; double qu1real=qu1;
    double Kmin1inv, X, U, negSreal, y1, y2, top, bottom;
    int S, limit;
    while ((K>1)&&(threshold<N))
        {
        Kmin1inv=1./(Kreal-1.);
        while(1)
            {//Step D2: generate X and U
            while(1)
                {
                X=Nreal*(1-Vprime);
                S=floor(X);
                if (S<qu1) {break;}
                Vprime=exp(log(uni())*Kinv);
                }
            U=uni();
            negSreal=-S;
            //step D3: Accept ?
            y1=exp(log(U*Nreal/qu1real)*Kmin1inv);
            Vprime=y1*(1. - X/Nreal)*(qu1real/(negSreal+qu1real));
            if (Vprime <=1.) {break;} //Accept ! Test [Vitter87](2.8) is true
            //step D4 Accept ?
            y2=0; top=Nreal-1.;
            if (K-1 > S)
                {bottom=Nreal-Kreal; limit=N-S;}
            else {bottom=Nreal+negSreal-1.; limit=qu1;}
            for(int t=N-1;t>=limit;t--)
                {y2*=top/bottom;top--; bottom--;}
            if (Nreal/(Nreal-X)>=y1*exp(log(y2)*Kmin1inv))
                {//Accept !
                Vprime=exp(log(uni())*Kmin1inv);
                break;
                }
            Vprime=exp(log(uni())*Kmin1inv);
            }
        // Step D5: Select the (S+1)th record
        currsample+=1+S;
        cout << curr << " : " << currsample << "\n";
        curr++;
        N-=S+1; Nreal+=negSreal-1.;
        K-=1; Kreal-=1; Kinv=Kmin1inv;
        qu1-=S; qu1real+=negSreal;
        threshold+=negalphainv;
        }
    if (K>1) {SequentialSamplesMethodA(K, N);}
    else {
        S=floor(N*Vprime);
        currsample+=1+S;
        cout << curr << " : " << currsample << "\n";
        }
    }


int main(void)
    {
    int Ntest=10000000, Ktest=Ntest/100;
    SequentialSamplesMethodD(Ktest,Ntest);
    return 0;
    }

$ time ./sampling|tail

gives the following ouptut on my laptop

99990 : 9998882
99991 : 9998885
99992 : 9999021
99993 : 9999058
99994 : 9999339
99995 : 9999359
99996 : 9999411
99997 : 9999427
99998 : 9999584
99999 : 9999745

real    0m0.075s
user    0m0.060s
sys 0m0.000s
-1

Here's a way to do it in O(N) without extra storage. I'm pretty sure this is not a purely random distribution, but it's probably close enough for many uses.

/* generate N sorted, non-duplicate integers in [0, max[  in O(N))*/
 int *generate(int n, int max) {
    float step,a,v=0;
    int i;    
    int *g = (int *)calloc(n, sizeof(int));
    if ( ! g) return 0;

    for (i=0; i<n; i++) {
        step = (max-v)/(float)(n-i);
        v+ = floating_pt_random_in_between(0.0, step*2.0);
        if ((int)v == g[i-1]){
          v=(int)v+1;             //avoid collisions
        }
        g[i]=v;
    }
    while (g[i]>max) {
      g[i]=max;                   //fix up overflow
      max=g[i--]-1;
    }
    return g;
 }
AShelly
  • 34,686
  • 15
  • 91
  • 152
-3

This is Perl Code. Grep is a filter, and as always I didn't test this code.

@list = grep ($_ % I) == 0, (0..N);
  • I = interval
  • N = Upper Bound

Only get numbers that match your interval via the modulus operator.

@list = grep ($_ % 3) == 0, (0..30);

will return 0, 3, 6, ... 30

This is pseudo Perl code. You may need to tweak it to get it to compile.

J.J.
  • 4,856
  • 1
  • 24
  • 29