3

I have the following problem:

Generate M uniformly random integers from the range 0-N, where N >> M, and where no pair has a difference less than K. where M >> K.

At the moment the best method I can think of is to maintain a sorted list, then determine the lower bound of the current generated integer and test it with the lower and upper elements, if it's ok to then insert the element in between. This is of complexity O(nlogn).

Would there happen to be a more efficient algorithm?

An example of the problem:

Generate 1000 uniformly random integers between zero and 100million where the difference between any two integers is no less than 1000

A comprehensive way to solve this would be to:

  1. Determine all the combinations of n-choose-m that satisfy the constraint, lets called it set X
  2. Select a uniformly random integer i in the range [0,|X|).
  3. Select the i'th combination from X as the result.

This solution is problematic when the n-choose-m is large, as enumerating and storing all possible combinations will be extremely costly. Hence an efficient online generating solution is sought.

Note: The following is a C++ implementation of the solution provided by pentadecagon

std::vector<int> generate_random(const int n, const int m, const int k)
{
   if ((n < m) || (m < k))
      return std::vector<int>();

   std::random_device source;
   std::mt19937 generator(source());
   std::uniform_int_distribution<> distribution(0, n - (m - 1) * k);

   std::vector<int> result_list;
   result_list.reserve(m);

   for (int i = 0; i < m; ++i)
   {
      result_list.push_back(distribution(generator));
   }

   std::sort(std::begin(result_list),std::end(result_list));

   for (int i = 0; i < m; ++i)
   {
      result_list[i] += (i * k);
   }

   return result_list;
}

http://ideone.com/KOeR4R

.

Soda Coader
  • 101
  • 12
  • How should the distribution be? There is a fixed number of possible outcomes. Should all of these have an equal probability? – Vincent van der Weele Feb 24 '14 at 06:30
  • @Heuster: 'How should the distribution be?' A uniform distribution. – Soda Coader Feb 24 '14 at 06:33
  • 1
    I don't think your example is valid, since 1000 >> 1000 isn't true. – Douglas Leeder Feb 24 '14 at 06:42
  • In your example, M = 1000, N = 100Million and K = 1000. You say M >> K. I agree with Douglas Leeder. – Scott Mermelstein Feb 24 '14 at 06:56
  • @Scott Douglas I agree my mistake, I've corrected it. – Soda Coader Feb 24 '14 at 06:59
  • You could generate M numbers randomly and start over if they were not a solution, but this is efficient only when M^2 K doesn't significantly exceed N. Unfortunately, every answer posted right now, including yours, does not yield a uniform distribution on outcomes. (This is self-evident for the two answers; for yours, observe that, for 2 random integers in the range 0..4 with no difference less than 2, it over-samples the outcome 13 at the expense of 04.) – David Eisenstat Feb 24 '14 at 14:05
  • 1
    (@DavidEisenstat Please correct me if I'm wrong, I think this should work:) Do a Fisher-Yates shuffle of an array containing the numbers between `0` and `N - (M - 1)*K + K`, take the last `K` numbers of the resulting array. This gives you a uniformly random subset of size `K` of the above interval. You can use this to build a `K + 1`-composition of the integer `N - (M - 1)*K` by using the `K`-subset as commas in a unary representation of `N - (M - 1)*K` (see [here](http://en.wikipedia.org/wiki/Composition_%28number_theory%29) for illustration). – G. Bach Feb 24 '14 at 19:16
  • To get the `i`th of your M values, you sum up the composition up to comma `i` and add `(i - 1) * K`; ignore what remains after the Kth comma. Since Fisher-Yates gives you a `K`-subset with uniform distribution and we get a bijection between the valid `(K + 1)`-compositions and solution sets, all solutions should be uniformly distributed as well. This will take `O(N)`. – G. Bach Feb 24 '14 at 19:17
  • I am not even sure that "Uniform Distribution" can be validly defined for a set of constraints like this. – RBarryYoung Feb 24 '14 at 19:53
  • @RBarryYoung The set of valid solutions is well-defined for all choices of N,M,K. You can always define a uniform distribution over a given (finite) set. – G. Bach Feb 24 '14 at 19:58
  • 1
    @G.Bach It might work, but then again there might be issues with the boundary conditions. Could you write it up carefully and post it as an answer? (By the way, there are more space-efficient ways to generate a random subset.) – David Eisenstat Feb 24 '14 at 20:28
  • @DavidEisenstat I described the construction slightly wrong, see my answer. – G. Bach Feb 24 '14 at 22:54

3 Answers3

3

EDIT: I adapted the text for the requirement to create ordered sequences, each with the same probability.

Create random numbers a_i for i=0..M-1 without duplicates. Sort them. Then create numbers

b_i=a_i + i*(K-1)

Given the construction, those numbers b_i have the required gaps, because the a_i already have gaps of at least 1. In order to make sure those b values cover exactly the required range [1..N], you must ensure a_i are picked from a range [1..N-(M-1)*(K-1)]. This way you get truly independent numbers. Well, as independent as possible given the required gap. Because of the sorting you get O(M log M) performance again, but this shouldn't be too bad. Sorting is typically very fast. In Python it looks like this:

import random
def random_list( N, M, K ):
    s = set()
    while len(s) < M:
        s.add( random.randint( 1, N-(M-1)*(K-1) ) )

    res = sorted( s )

    for i in range(M):
        res[i] += i * (K-1)

    return res
pentadecagon
  • 4,717
  • 2
  • 18
  • 26
  • 1
    Sorry for slandering this answer above. It looks correct to me now that I've read it carefully. – David Eisenstat Feb 24 '14 at 23:17
  • 2
    Thinking about it, I'm not so sure anymore that this produces a uniform distribution. This approach maps each sorted sequence (a_0,...,a_(M-1)) to a solution set. In order to get the solution set (0,K,2K,...,(M-1)K) you need to draw the sequence (0,...,0), which has probability (N-(M-1)*K)^(-M). Now take for example the result of the sequence (1,1,2,3,...,M-1). The probability of getting that sequence is at least twice as big as that for (0,...,0) since you can draw (1,2,...,M-1,1) and (1,1,2,...,M-1) before sorting, for example. Should this not give something more like a normal distribution? – G. Bach Feb 25 '14 at 01:07
  • That's the thing about sorted random numbers. When you roll two dice and sort the results, getting 1,2 is twice as likely as 1,1. It's still a uniform distribution. If you want unsorted numbers you can remember the original order and restore it after adding **i*K**. Or you just create a random permutation. – pentadecagon Feb 25 '14 at 03:04
  • 1
    For the algorithm you give to produce a uniform distribution over the set of valid solutions (which is what I take it Soda Coader is looking for), you'd have to draw sorted sequences uniformly; otherwise there is a bias to those solutions that are generated from sorted sequences with higher probability, no? You're saying your algorithm produces a uniform distribution; but over what set? – G. Bach Feb 25 '14 at 03:17
  • Guys, I ran a simulation using this method and the distribution doesn't look uniform. Please change 'rounds' to something like eg: 100000000 http://ideone.com/psfgDu details: range: 1-10, difference no less than 2 select 3 – Soda Coader Feb 25 '14 at 03:45
  • @SodaCoader Over what set do you want the distribution to be uniform? Over the set of all valid solution sets? pentadecagon's algorithm does not produce a uniform distribution over that set. – G. Bach Feb 25 '14 at 03:56
  • @SodaCoder You should consider K=0, N=2, M=2 to make the point even more clear. You will find (0,1) twice as often as (0,0) and (1,1). As would get from rolling dice. We create a uniformly distributed *unsorted* set, as a result the probabilities of multiple identical numbers in the sorted set is different. – pentadecagon Feb 25 '14 at 04:06
  • @G.Bach It should be uniform over the combinations. In the example posted, the distribution, given a large enough "rounds", I think should result in all the combinations having an equal (or near equal) probability of being chosen. – Soda Coader Feb 25 '14 at 04:06
  • @pentadecagon If thats the thinking then perhaps I misworded the question, as my original intent was always to have the combinations be equally likely when selected. – Soda Coader Feb 25 '14 at 04:09
  • @SodaCoader In that case you may want to have another look at my answer; that one draws uniformly from the valid solutions. I'm not sure how pentadecagon's answer could be adjusted to draw uniformly from the set of sorted sequences. – G. Bach Feb 25 '14 at 04:10
  • @SodaCoder The fix to avoid duplicate values at all. As a result we already have a gap of 1, so we can reduce the number to be added by 1. http://ideone.com/K8X0z6 – pentadecagon Feb 25 '14 at 05:04
  • @pentadecagon That looks like it's fixed. The only thing to consider now is the time complexity; can you give an analysis of its expected running time? Worst case is that it doesn't terminate, but that's to be expected for some randomized algorithms. – G. Bach Feb 25 '14 at 05:17
  • @G.Bach As long as there are not too many collisions, that means N>M*(K+1), the sorting is still the bottleneck, so it's O(M log M). – pentadecagon Feb 25 '14 at 05:26
  • @pentadecagon Hm, I would expect a lot of collisions as soon as M is roughly sqrt(N) due to the birthday problem. – G. Bach Feb 25 '14 at 14:31
  • 1
    @G.Bach Nope. The birthday paradox is about finding any collisions at all, not about their frequency. As long as less than 50% of all numbers are occupied, the probability of a collision stays below 50%. And this is the case for N>M*(K+1). – pentadecagon Feb 25 '14 at 15:06
2

First off: this will be an attempt to show that there's a bijection between the (M+1)- compositions (with the slight modification that we will allow addends to be 0) of the value N - (M-1)*K and the valid solutions to your problem. After that, we only have to pick one of those compositions uniformly at random and apply the bijection.


Bijection:

Let

M+1 - composition

Then the xi form an M+1-composition (with 0 addends allowed) of the value on the left (notice that the xi do not have to be monotonically increasing!).

From this we get a valid solution

solution set

by setting the values mi as follows:

construction composition to solution

We see that the distance between mi and mi + 1 is at least K, and mM is at most N (compare the choice of the composition we started out with). This means that every (M+1)-composition that fulfills the conditions above defines exactly one valid solution to your problem. (You'll notice that we only use the xM as a way to make the sum turn out right, we don't use it for the construction of the mi.)

To see that this gives a bijection, we need to see that the construction can be reversed; for this purpose, let

solution set

be a given solution fulfilling your conditions. To get the composition this is constructed from, define the xi as follows:

construction solution to composition

Now first, all xi are at least 0, so that's alright. To see that they form a valid composition (again, every xi is allowed to be 0) of the value given above, consider:

enter image description here

The third equality follows since we have this telescoping sum that cancels out almost all mi.

So we've seen that the described construction gives a bijection between the described compositions of N - (M-1)*K and the valid solutions to your problem. All we have to do now is pick one of those compositions uniformly at random and apply the construction to get a solution.


Picking a composition uniformly at random

Each of the described compositions can be uniquely identified in the following way (compare this for illustration): reserve N - (M-1)*K spaces for the unary notation of that value, and another M spaces for M commas. We get an (M+1)- composition of N - (M-1)*K by choosing M of the N - (M-1)*K + M spaces, putting commas there, and filling the rest with |. Then let x0 be the number of | before the first comma, xM+1 the number of | after the last comma, and all other xi the number of | between commas i and i+1. So all we have to do is pick an M-element subset of the integer interval[1; N - (M-1)*K + M] uniformly at random, which we can do for example with the Fisher-Yates shuffle in O(N + M log M) (we need to sort the M delimiters to build the composition) since M*K needs to be in O(N) for any solutions to exist. So if N is bigger than M by at least a logarithmic factor, then this is linear in N.


Note: @DavidEisenstat suggested that there are more space efficient ways of picking the M-element subset of that interval; I'm not aware of any, I'm afraid.


You can get an error-proof algorithm out of this by doing the simple input validation we get from the construction above that N ≥ (M-1) * K and that all three values are at least 1 (or 0, if you define the empty set as a valid solution for that case).

G. Bach
  • 3,869
  • 2
  • 25
  • 46
  • 2
    [Sampling a random subset](http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-random-combination-of-values). I believe that this answer correctly pulls a uniform sample. – David Eisenstat Feb 24 '14 at 23:14
  • It was rather a verbose but nonetheless also a very interesting and comprehensive explanation. thank-you. – Soda Coader Feb 24 '14 at 23:49
  • @G. Bach For a given N,M,K, taking into account all the viable combinations, if one were to determine the (M-1) differences between consecutive elements in each combination, would the distribution of consecutive differences be uniformly distributed? – Soda Coader Feb 25 '14 at 04:54
  • @SodaCoader I'm not sure I understand the question. What we draw are solution sets consisting of M elements where in sorted order, each element has distance at least K to both its predecessor and successor. For any (N,M,K) where a solution set exists, the possible difference sequences have a known number of occurrences in the set of valid solutions, and they differ from delta sequence to delta sequence. For example, the difference sequence (K,K,...,K) occurs in exactly N-(M-1)*K valid solutions, while the difference sequence (N-(M-1)*K,K,K,...,K) only occurs for one valid solution set. – G. Bach Feb 25 '14 at 05:24
1

Why not do this:

for (int i = 0; i < M; ++i) {
  pick a random number between K and N/M
  add this number to (N/M)* i;

Now you have M random numbers, distributed evenly along N, all of which have a difference of at least K. It's in O(n) time. As an added bonus, it's already sorted. :-)

EDIT:

Actually, the "pick a random number" part shouldn't be between K and N/M, but between min(K, [K - (N/M * i - previous value)]). That would ensure that the differences are still at least K, and not exclude values that should not be missed.

Second EDIT:

Well, the first case shouldn't be between K and N/M - it should be between 0 and N/M. Just like you need special casing for when you get close to the N/M*i border, we need special initial casing.

Aside from that, the issue you brought up in your comments was fair representation, and you're right. As my pseudocode is presented, it currently completely misses the excess between N/M*M and N. It's another edge case; simply change the random values of your last range.

Now, in this case, your distribution will be different for the last range. Since you have more numbers, you have slightly less chance for each number than you do for all the other ranges. My understanding is that because you're using ">>", this shouldn't really impact the distribution, i.e. the difference in size in the sample set should be nominal. But if you want to make it more fair, you divide the excess equally among each range. This makes your initial range calculation more complex - you'll have to augment each range based on how much remainder there is divided by M.

There are lots of special cases to look out for, but they're all able to be handled. I kept the pseudocode very basic just to make sure that the general concept came through clearly. If nothing else, it should be a good starting point.

Third and Final EDIT:

For those worried that the distribution has a forced evenness, I still claim that there's nothing saying it can't. The selection is uniformly distributed in each segment. There is a linear way to keep it uneven, but that also has a trade-off: if one value is selected extremely high (which should be unlikely given a very large N), then all the other values are constrained:

int prevValue = 0;
int maxRange;
for (int i = 0; i < M; ++i) {
    maxRange = N - (((M - 1) - i) * K) - prevValue;
    int nextValue = random(0, maxRange);
    prevValue += nextValue;
    store previous value;
    prevValue += K;
}

This is still linear and random and allows unevenness, but the bigger prevValue gets, the more constrained the other numbers become. Personally, I prefer my second edit answer, but this is an available option that given a large enough N is likely to satisfy all the posted requirements.

Come to think of it, here's one other idea. It requires a lot more data maintenance, but is still O(M) and is probably the most fair distribution:

What you need to do is maintain a vector of your valid data ranges and a vector of probability scales. A valid data range is just the list of high-low values where K is still valid. The idea is you first use the scaled probability to pick a random data range, then you randomly pick a value within that range. You remove the old valid data range and replace it with 0, 1 or 2 new data ranges in the same position, depending on how many are still valid. All of these actions are constant time other than handling the weighted probability, which is O(M), done in a loop M times, so the total should be O(M^2), which should be much better than O(NlogN) because N >> M.

Rather than pseudocode, let me work an example using OP's original example:

  • 0th iteration: valid data ranges are from [0...100Mill], and the weight for this range is 1.0.
  • 1st iteration: Randomly pick one element in the one element vector, then randomly pick one element in that range.
    • If the element is, e.g. 12345678, then we remove the [0...100Mill] and replace it with [0...12344678] and [12346678...100Mill]
    • If the element is, e.g. 500, then we remove the [0...100Mill] and replace it with just [1500...100Mill], since [0...500] is no longer a valid range. The only time we will replace it with 0 ranges is in the unlikely event that you have a range with only one number in it and it gets picked. (In that case, you'll have 3 numbers in a row that are exactly K apart from each other.)
    • The weight for the ranges are their length over the total length, e.g. 12344678/(12344678 + (100Mill - 12346678)) and (100Mill - 12346678)/(12344678 + (100Mill - 12346678))

In the next iterations, you do the same thing: randomly pick a number between 0 and 1 and determine which of the ranges that scale falls into. Then randomly pick a number in that range, and replace your ranges and scales.

By the time it's done, we're no longer acting in O(M), but we're still only dependent on the time of M instead of N. And this actually is both uniform and fair distribution.

Hope one of these ideas works for you!

Scott Mermelstein
  • 15,174
  • 4
  • 48
  • 76
  • That's an interesting solution, but does it guarantee every possible combination has the same probability of generation? – Soda Coader Feb 24 '14 at 06:41
  • 'pick a random number between K and N/M' - When M is not completely divisible by N, wont that cause a bias against the last element? – Soda Coader Feb 24 '14 at 06:44
  • @SodaCoader I can't say that it won't, though I would expect you could vary between floor(N/M) and ceil(N/M) enough so that you can still pretty fairly divide the sections. But with all the ">>" in the question, will you honestly have enough samples to see that there is a bias? – Scott Mermelstein Feb 24 '14 at 06:48
  • @Scott Consider the simple example of 1-10 where M and K of 3 and 2. (1 3 9) and (1 3 10) are valid, not sure how the 2nd one will be generated. – Soda Coader Feb 24 '14 at 06:53
  • 1
    Since M = 3, I have to assume you mean (1 5 9) and (1 5 10). Note, that in this example, I don't think we're being true to ">>" (this matters, since distribution issues can even out over large differences), but your example does bring up some edge cases I'll address in my answer. – Scott Mermelstein Feb 24 '14 at 07:02
  • I don't see how this algorithm guarantees a uniform distribution of all possibilities. I think it has a bias towards low values. In particular, how would it generate the "highest possible" set (`n_(i+1) - n_i = K` and `n_M = N`)? – Vincent van der Weele Feb 24 '14 at 07:10
  • 1
    These solutions guarantee a certain evenness to the distribution that a random distribution wouldn't have; you'd expect there to be a few gaps of a few times N/M in the results. Uniformity is normally wanted in probability of selection for the random numbers, not in their distribution. – Tony Delroy Feb 24 '14 at 07:13
  • @TonyD I can't disagree with you - this is more even than truly random. But as far as I understand, this doesn't make it less uniformly distributed. I.e. lack of unevenness was not a design constraint, right? – Scott Mermelstein Feb 24 '14 at 07:17
  • 2
    This solution can't generate every possible solution set for some choices of N, M, K. For example, take N=100, M=10000, K=10, then a possible solution would be all multiples of 10 up to 1000, but this approach can never generate that since it only generates one number in the range [1; 100], for example. – G. Bach Feb 24 '14 at 11:25