4

What is the fastest way to generate a specific nr of integers of random value uniformly distributed within a specific range and with a minimum distance between each element?

For example, given a sequence range between 0 and 20, we want to create 5 elements with at least 3 points distance between each element, the result could be something like this [0,5,11,14,19] or [2,5,9,13,18]

I created a loop that achieves this but it is very slow when i want to create ranges in the order of millions.

RaduS
  • 2,465
  • 9
  • 44
  • 65
  • what is the condition for starting point and how you define the step between them?!! – Charif DZ Apr 20 '18 at 19:39
  • starting point can be anything close to the min distance, in this case, it could be, 0,1,2,3,4,5,6. The step between them is min distance plus random distance, but not big enough so that the other numbers won't fit in¨ – RaduS Apr 20 '18 at 19:41
  • why this value exactly, how did you calculate the min distance? – Charif DZ Apr 20 '18 at 19:42
  • well min distance is just a part of the condition, it is given. For example, the min distance will never be bigger so that the elements won't fit in the given sequence – RaduS Apr 20 '18 at 19:44
  • Do you want uniformly distributed, or do you want to enforce the minimum distance? You can't have both. – Mark Dickinson Apr 20 '18 at 19:47
  • min distance is just so that the numbers wont be to closer to each other. The real ranges i am working on are like this: 7 elements between the range of 0 and 10000 with a min distance of 300 between each – RaduS Apr 20 '18 at 19:50
  • Sure, but the minimum distance condition is inconsistent with "uniformly distributed within a specific range". For example, suppose you want 2 samples from `[0, 6]` with a minimum distance of 4. Then neither sample can be equal to 3, so there's no hope of a uniform distribution. – Mark Dickinson Apr 20 '18 at 19:54
  • but there will never be a request of 2 samples between 0 and 6 with min dis of 4. Like i said the given nr of elements, the range and the min distance are precalculated so that everything fits in – RaduS Apr 20 '18 at 19:56
  • You can fit 2 samples in `[0, 6]` with a minimum distance of 4, so "everything fits in" is true. What exactly are your constraints? The question doesn't make sense as it stands. – Mark Dickinson Apr 20 '18 at 19:59

3 Answers3

3

How about the following recipe: If you want a gap of 3 between your 5 adjacent elements, but want a total range of 20, then you effectively have 20 - (5-1)*3 steps of "slack" that you can randomly distribute in the gaps between your elements. Suppose we generate a number in that range, and scatter it between the elements, then we end up with code something like the following:

import numpy, random

n = 5
limit = 20
mingap = 3

slack = 20 - mingap * (n - 1)

def generate():
    steps = random.randint(0, slack)

    increments = numpy.hstack([numpy.ones((steps,)), numpy.zeros((n,))])
    numpy.random.shuffle(increments)

    locs = numpy.argwhere(increments == 0).flatten()
    return numpy.cumsum(increments)[locs] + mingap * numpy.arange(0, n)

If you then invoke this generate() function ten times, you get a collection of vectors something like the following:

[  0.   3.   6.   9.  12.]
[  0.   3.   6.  10.  13.]
[  2.   5.   8.  12.  15.]
[  1.   4.   7.  12.  16.]
[  0.   4.   7.  10.  13.]
[  0.   3.   6.   9.  12.]
[  1.   4.   9.  12.  16.]
[  0.   7.  10.  13.  16.]
[  0.   5.   8.  11.  14.]
[  1.   4.   8.  11.  17.]
rwp
  • 1,786
  • 2
  • 17
  • 28
  • well, thank you @rwp, this looks really interesting, i will do a bit more testing with this, but so far looks like what i was looking for – RaduS Apr 20 '18 at 19:46
  • with a bit of tweaking this is what i was looking for. I just added some filters on top and it works perfectly. Thank you – RaduS Apr 20 '18 at 20:09
2

This:

np.cumsum(np.ones((5,), np.int) * 3 + np.random.randint(0, maxn, (5,))) - 3

will give you 5 random numbers spaced by at least 3 points.

You have to tweak maxn to get the correct maximum value of the random numbers. Perhaps you may want to have a slightly bigger maxn and the reject samples whose elements exceed your maximum value (20).

Note: you didn't say what final distribution you are looking for, e.g. if you want the resulting samples uniformly distributed over the space of the valid samples, or anything else, if that matters.

fferri
  • 18,285
  • 5
  • 46
  • 95
  • It's not obvious that you can choose a fixed `maxn` that will guarantee that the largest value in the sequence is no bigger than 20. Setting a larger value of `maxn` will probably skew the distribution of values away from zero, which may not be what the OP intended. – rwp Apr 20 '18 at 19:12
  • hmmm, i tried it now with different settings and i don't see how i can get the sequence i am looking for – RaduS Apr 20 '18 at 19:15
  • @rwp: please clarify. do you mean if doing that *and* rejection sampling? because with high value of `maxn` you would have to reject lots of samples, which is what the OP wanted to avoid in the first place, if I understood correctly. – fferri Apr 20 '18 at 19:19
  • @RaduS: with `maxn`=3 you get more or less what you are looking for. with `maxn`=10 is more "fair" altough you have to reject samples whose last number exceeds 20. Is the final distribution of the numbers relevant in some way? – fferri Apr 20 '18 at 19:22
  • @rwp: I actually think it's the opposite: setting `maxn` to a low value will skew the distribution. in fact, take for example `maxn`=3, it will never generate the sample `[0,10,13,16,19]`, altough this is a perfectly valid sample according to OP's rule. – fferri Apr 20 '18 at 19:26
  • Yes, i will put that into the question, the samples must be uniformly distributed. – RaduS Apr 20 '18 at 19:31
  • 2
    you can't have the resulting numbers *uniformly distributed* and at the same time respect the condition of the minimum distance, as that is going to remove some combinations, such as [0,0,0,0,0] or [0,2,4,6,8], hence each distribution cannot be uniformly random. But, if you enumerate the valid configurations, then you can uniformly sample from that. I'll add another answer to provide an example of this, if that works for you... – fferri Apr 20 '18 at 19:53
1

This answer is a followup to the comments on my previous answer.

You said you want uniformly distributed numbers, but that of course is not possible to have while respecting the condition that the numbers must be spaced by at least 3 points.

So, I provide you a different definition of uniformly randomness: suppose you can enumerate all the valid vectors respecting your condition. I wrote a function to do so:

def space_gen(xmin, xmax, len, min_dist, result=[]):
    if len:
        for x in range(xmin, xmax - (len - 1) * min_dist):
            yield from space_gen(x + min_dist, xmax, len - 1, min_dist, result + [x])
    else:
        yield result

Let's consider a smaller instance of the problem. Suppose you want vectors of 3 random numbers from 0 to 10 (excluded), spaced by at least 4 points:

>>> list(space_gen(0,10,3,4))
[[0, 4, 8], [0, 4, 9], [0, 5, 9], [1, 5, 9]]

that list is the complete enumeration of all valid results according to that rule.

Now you can uniformly sample from this list (see for example random.choice).

Now it's possible that your problem size (i.e. the range, or the vector size) make the problem instance too big to be exhaustively enumerated.

But you can still use this "brute force" enumeration to check if a method generates truly uniformly distributed samples.

For the problem instance of your question (0-20 range, 5 length, 3 min. dist) it's still doable:

>>> len(list(space_gen(0,21,5,3)))
1287

For example, we can check if rwp's recipe generates uniformly distributed samples (according to this definition):

space = list(space_gen(0, 21, 5, 3))
counter = {tuple(x): 0 for x in space}
for _ in range(200000):
    x = tuple(map(int,generate()))
    counter[x] += 1
import matplotlib.pyplot as plt
a = np.array(sorted(counter.values()))
plt.hist(a, bins=len(space))
plt.show()

and we observe this distribution of counts:

histogram plot

Clearly there are some vectors occurring way more often than other vectors.

We can also check the first solution I proposed:

def generate1():
    maxn=15
    while 1:
        x = np.cumsum(np.ones((5,), np.int) * 3 + np.random.randint(0, maxn, (5,))) - 3
        if x[-1] <= 20:
            return x

even with maxn=15 and using rejection sampling, it's still a bit skew and not perfectly uniform. Using the same benchmark/plot code as before:

histogram plot

fferri
  • 18,285
  • 5
  • 46
  • 95