Efficiently generating unique pairs of integers

Question

In MATLAB, I would like to generate n pairs of random integers in the range [1, m], where each pair is unique. For uniqueness, I consider the order of the numbers in the pair to be irrelevant such that [3, 10] is equal to [10, 3]. Also, each pair should consist of two distinct integers; i.e. [3, 4] is ok but [3, 3] would be rejected. EDIT: Each possible pair should be chosen with equal likelihood.

(Obviously a constraint on the parameters is that n <= m(m-1)/2.)

I have been able to successfully do this when m is small, like so:

m = 500; n = 10;                   % setting parameters

A = ((1:m)'*ones(1, m));           % each column has the numbers 1 -> m
idxs1 = squareform(tril(A', -1))'; 
idxs2 = squareform(tril(A, -1))';   
all_pairs = [idxs1, idxs2];        % this contains all possible pairs

idx_to_use = randperm( size(all_pairs, 1), n );  % choosing random n pairs
pairs = all_pairs(idx_to_use, :)       

pairs =

   254   414
   247   334
   111   146
   207   297
    45   390
   229   411
     9    16
    75   395
    12   338
    25   442

However, the matrix A is of size m x m, meaning when m becomes large (e.g. upwards of 10,000), MATLAB runs out of memory.

I considered generating a load of random numbers randi(m, [n, 2]), and repeatedly rejecting the rows which repeated, but I was concerned about getting stuck in a loop when n was close to m(m-1)/2.

Is there an easier, cleaner way of generating unique pairs of distinct integers?

Hmmm, have you tried generating maybe a A with maybe `unique(round(rand(n+20,2)*m),'rows')`, testing if the length is at least `n` (and if not, then repeating the process), and then selecting the first `n` rows? This might be slower but it's worth a shot — JustinBlaber, Apr 03 '13 at 16:56
@jucestain yes, I did consider this - I mention this approach in the second-to-last paragraph. My concern is that if n is very large the algorithm will have to repeatedly loop until it has effectively found every pair by chance. — Bill Cheatham, Apr 03 '13 at 17:11

score 12 · Accepted Answer · 2013-04-04T15:29:40.557

Easy, peasy, when viewed in the proper way.

You wish to generate n pairs of integers, [p,q], such that p and q lie in the interval [1,m], and p

How many possible pairs are there? The total number of pairs is just m*(m-1)/2. (I.e., the sum of the numbers from 1 to m-1.)

So we could generate n random integers in the range [1,m*(m-1)/2]. Randperm does this nicely. (Older matlab releases do not allow the second argument to randperm.)

k = randperm(m/2*(m-1),n);

(Note that I've written this expression with m in a funny way, dividing by 2 in perhaps a strange place. This avoids precision problems for some values of m near the upper limits.)

Now, if we associate each possible pair [p,q] with one of the integers in k, we can work backwards, from the integers generated in k, to a pair [p,q]. Thus the first few pairs in that list are:

{[1,2], [1,3], [2,3], [1,4], [2,4], [3,4], ..., [m-1,m]}

We can think of them as the elements in a strictly upper triangular array of size m by m, thus those elements above the main diagonal.

q = floor(sqrt(8*(k-1) + 1)/2 + 1/2);
p = k - q.*(q-1)/2;

See that these formulas recover p and q from the unrolled elements in k. We can convince ourselves that this does indeed work, but perhaps a simple way here is just this test:

k = 1:21;
q = floor(sqrt(8*(k-1) + 1)/2 + 3/2);
p = k - (q-1).*(q-2)/2;
[k;p;q]'

ans =
     1     1     2
     2     1     3
     3     2     3
     4     1     4
     5     2     4
     6     3     4
     7     1     5
     8     2     5
     9     3     5
    10     4     5
    11     1     6
    12     2     6
    13     3     6
    14     4     6
    15     5     6
    16     1     7
    17     2     7
    18     3     7
    19     4     7
    20     5     7
    21     6     7

Another way of testing it is to show that all pairs get generated for a small case.

m = 5;
n = 10;
k = randperm(m/2*(m-1),n);
q = floor(sqrt(8*(k-1) + 1)/2 + 3/2);
p = k - (q-1).*(q-2)/2;

sortrows([p;q]',[2 1])
ans =
     1     2
     1     3
     2     3
     1     4
     2     4
     3     4
     1     5
     2     5
     3     5
     4     5

Yup, it looks like everything works perfectly. Now try it for some large numbers for m and n to test the time used.

tic
m = 1e6;
n = 100000;
k = randperm(m/2*(m-1),n);
q = floor(sqrt(8*(k-1) + 1)/2 + 3/2);
p = k - (q-1).*(q-2)/2;
toc

Elapsed time is 0.014689 seconds.

This scheme will work for m as large as roughly 1e8, before it fails due to precision errors in double precision. The exact limit should be m no larger than 134217728 before m/2*(m-1) exceeds 2^53. A nice feature is that no rejection for repeat pairs need be done.

Very clever, well done. Deterministically mapping random integers to pairs of numbers is a neat trick. Thanks. — Bill Cheatham, Apr 03 '13 at 23:27
Rereading it, I just noticed that [3,3] would be rejected in the original problem statement. So I'll revise things. The change is minor. — , Apr 04 '13 at 14:46

score 1 · Answer 2 · answered Apr 03 '13 at 16:54

1

This is more of a general approach rather then a matlab solution.

How about you do the following first you fill a vector like the following.

x[n] = rand()
x[n + 1] = x[n] + rand() %% where rand can be equal to 0.

Then you do the following again

x[n][y] = x[n][y] + rand() + 1

And if

x[n] == x[n+1]

You would make sure that the same pair is not already selected.

After you are done you can run a permutation algorithm on the matrix if you want them to be randomly spaced.

This approach will give you all the possibility or 2 integer pairs, and it runs in O(n) where n is the height of the matrix.

answered Apr 03 '13 at 16:54

Max Doumit

1,065
12
33

Thanks, this sounds promising but I don't quite understand the notation. Is `x` a vector or a two-dimensional matrix? What are `n` and `y`? Thanks. – Bill Cheatham Apr 03 '13 at 17:08
So i come from a C++ background. x[n] is a n x 1 matrix, and x[n][y] is a n x y matric, a vector is a n x 1 matrix. – Max Doumit Apr 03 '13 at 17:12
Ok thanks. I think the step of making sure each pair is not already selected could cause this to be slow when `n` is close to `m(m-1)/2` as I have stated above. – Bill Cheatham Apr 03 '13 at 17:19
Actually it runs in O(n) so you just check the entry that is above you once for each. So for each entry you just just check x[n+1] you do not need to check the full list. – Max Doumit Apr 03 '13 at 17:22
Ok yes I see now thanks. I think this will work but I would ultimately like each possible pair to be selected with equal probability; I suppose this doesn't do this? – Bill Cheatham Apr 03 '13 at 17:24
Theoretically every pair is possible with this approach since [ 3 , 4 ] is the same as [ 4 , 3 ]. This is why i am increasing the first the column. And since rand() can be equal to zero the following pairs arr possible [ 3 , 4 ] [ 3 , 5 ]. The reason why in the second column i am adding the value of the first is to avoid redundancy. So [3 , 4] can only happen on where the first column is equal to 3. If i want to generate the pair [ 3 , 2 ], that pair will be generated as [ 2 , 3 ] where the column is equal to 2. – Max Doumit Apr 03 '13 at 17:31

Roney Michael · Answer 3 · 2013-04-04T03:40:20.337

1

The following code does what you need:

n = 10000;
m = 500;
my_list = unique(sort(round(rand(n,2)*m),2),'rows');
my_list = my_list(find((my_list(:,1)==my_list(:,2))==0),:);
%temp = my_list;    %In case you want to check what you initially generated.
while(size(my_list,1)~=n)
    %my_list = unique([my_list;sort(round(rand(1,2)*m),2)],'rows');
    %Changed as per @jucestain's suggestion.
    my_list = unique([my_list;sort(round(rand((n-size(my_list,1)),2)*m),2)],'rows');
    my_list = my_list(find((my_list(:,1)==my_list(:,2))==0),:);
end

edited Apr 04 '13 at 03:40

answered Apr 03 '13 at 17:15

Roney Michael

3,964
5
30
45

Thanks, this works well when `n << m(m-1)/2`, but when I set `n = m(m-1)/2 - 1` this seems to run forever, whereas my original approach takes just a few seconds. Maybe I need two different approaches depending on the values of `n` and `m`... – Bill Cheatham Apr 03 '13 at 17:18
@BillCheatham: Maybe so. The smaller the value of `n` and the larger the value of `m`, the better will this work. – Roney Michael Apr 03 '13 at 17:22
1

If I understand correctly it's taking forever because you're simply using `rand(n,2)` instead of like `rand(n+padding,2)` as I suggested in my comments. The padding will decrease the likelihood of having to recalculate `my_list` again. Afterwards just select the first n rows of my_list... This is a simple way to do it but there's probably a more mathematically sound and direct way to implement this. – JustinBlaber Apr 03 '13 at 19:01

Efficiently generating unique pairs of integers

3 Answers3

Linked