18

Just been looking at a code golf question about generating a sorted list of 100 random integers. What popped into my head, however, was the idea that you could generate instead a list of positive deltas, and just keep adding them to a running total, thus:

deltas: 1 3 2  7  2
ints:   1 4 6 13 15

In fact, you would use floats, then normalise to fit some upper limit, and round, but the effect is the same.

Although it wouldn't make for shorter code, it would certainly be faster without the sort step. But the thing I have no real handle on is this: Would the resulting distribution of integers be the same as generating 100 random integers from a uniformly distributed probability density function?

Edit: A sample script:

import random,sys
running = 0
max = 1000
deltas = [random.random() for i in range(0,11)]
floats = []
for d in deltas:
    running += d
    floats.append(running)
upper = floats.pop()
ints = [int(round(f/upper*max)) for f in floats]
print(ints)

Whose output (fair dice roll) was:

[24, 71, 133, 261, 308, 347, 499, 543, 722, 852]

UPDATE: Alok's answer and Dan Dyer's comment point out that using an exponential distribution for the deltas would give a uniform distribution of integers.

Community
  • 1
  • 1
Phil H
  • 19,928
  • 7
  • 68
  • 105

8 Answers8

20

So you are asking if the numbers generated in this way are going to be uniformly distributed.

You are generating a series:

yj = ∑i=0j ( xi / A )

where A is the sum of all xi. xi is the list of (positive) deltas.

This can be done iff xi are exponentially distributed (with any fixed mean). So, if xi are uniformly distributed, the resulting yj will not be uniformly distributed.

Having said that, it's fairly easy to generate exponential xi values.

One example would be:

sum := 0
for I = 1 to N do:
    X[I] = sum = sum - ln(RAND)
sum = sum - ln(RAND)
for I = 1 to N do:
    X[I] = X[I]/sum

and you will have your random numbers sorted in the range [0, 1).

Reference: Generating Sorted Lists of Random Numbers. The paper has other (faster) algorithms as well.

Of course, this generates floating-point numbers. For uniform distribution of integers, you can replace sum above by sum/RANGE in the last step (i.e., the R.H.S becomes X[I]*RANGE/sum, and then round the numbers to the nearest integer).

Alok Singhal
  • 93,253
  • 21
  • 125
  • 158
  • 1
    Excellent! It's always been done before! I don't understand the line of code saying X[I] = sum = sum - ln(RAND). Why is it subtracting?BTW, maybe format your equation in HTML as: yj = ∑i=0j ( xi / A ) – Phil H Dec 08 '09 at 11:25
  • This is a much more _useful_ answer than mine! – Rupert Nash Dec 08 '09 at 11:58
  • Thanks Phil, I have edited the reply using HTML. About why the subtraction, we're adding -ln(RAND), which is the inverse cumulative distribution function for exponential distribution. The subtraction right after the first loop makes sure that X[N] != 1. – Alok Singhal Dec 08 '09 at 15:10
  • Cheers, the notation threw me a bit, and I forgot that RAND would be < 1. In languages other than Python I would put that in the code, but it has its own exponential variate function. Top stuff. – Phil H Dec 08 '09 at 16:13
  • When I tried implementing this with the `RANGE/sum` change, the last item is always equal to `RANGE`. – efritz Nov 12 '12 at 00:27
  • @efritz, I just tried this and I am not seeing this behavior. Are you sure you're doing everything in the sample code? For example, there is a `sum = sum - ln(RAND)` right after the first loop. – Alok Singhal Nov 19 '12 at 16:49
  • 1
    Here is a version of the paper that's not behind a paywall: http://repository.cmu.edu/cgi/viewcontent.cgi?article=3483&context=compsci – Thomas Ahle Feb 20 '14 at 18:53
5

A uniform distribution has an upper and a lower bound. If you use your proposed method, and your deltas happen to be chosen large enough that you run into the upper bound before you have generated all your numbers, what would your algorithm do next?

Having said that, you may want to investigate the Poisson distribution, which is the distribution of interval times between random events occurring with a given average frequency.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • I think he's already answered this hasn't he? If you're using floats you set you calculate the maximum upper bound from a multiple of the maximum delta size. Then at the end you normalize so that the data range exactly matches the upper bound. – Benj Dec 08 '09 at 10:28
  • Yes, although you'd generate a final extra number to throw away so that the last number isn't always just the upper limit. Added a code sample for clarity. – Phil H Dec 08 '09 at 11:06
  • The Poisson distribution is interesting, and probably what is needed here, but it gives you the probability of a number of events occurring in a given time, rather than the probability distribution of time between single events. Any idea how to modify it to get that? – Phil H Dec 08 '09 at 11:12
  • @Phil H: Use an exponential distribution for time between events. – Dan Dyer Dec 08 '09 at 11:21
  • @Dan: Thanks for that, and also to Alok who pointed this out in his answer. – Phil H Dec 08 '09 at 11:28
4

If you take the number range of being 1 to 1000, and you have to use 100 of these numbers, the delta will have to be as a minimum 10, otherwise you can not reach the 1000 mark. How about some working to demonstrate it in action...

The chance of any given number in an evenly distributed random selection is 100/1000 e.g. 1/10 - no shock there, take that as the basis.

Assuming you start using a delta and that delta is just 10.

The odds of getting the number 1 is 1/10 - seems fine. The odds of getting the number 2 is 1/10 + (1/10 * 1/10) (because you could hit 2 deltas of 1 in a row, or just hit a 2 as the first delta.) The odds of getting the number 3 is 1/10 + (1/10 * 1/10 * 1/10) + (1/10 * 1/10) + (1/10 * 1/10)

The first case was a delta of 3, the second was hitting 3 deltas of 1 in a row, the third case would be a delta of 1 followed by a 2, and the fourth case was a delta of 2 followed by a 1.

For the sake of my fingers typing, we won't generate the combinations that hit 5.

Immediately the first few numbers have a greater percentage chance than the straight random.

This could be altered by changing the delta value so the fractions are all different, but I do not believe you could find a delta that produced identical odds.

To give an analogy that might just sink it, if you consider your delta as just 6 and you run that twice it is the equivalent of throwing 2 dice - each of the deltas is independant, but you know that 7 has a higher chance of being selected than 2.

Andrew
  • 26,629
  • 5
  • 63
  • 86
  • What you're saying is that you can't use a *uniform* distribution for the delta values. That's correct, and that's the advantage the Poisson distribution offers in this situation. – Greg Hewgill Dec 08 '09 at 10:56
2

I think it will be extremely similar but the extremes will be different because of the normalization. For example, 100 numbers chosen at random between 1 and 100 could all be 1. However, 100 numbers created using your system could all have deltas of 0.01 but when you normalize them you'll scale them up to be in the range 1 -> 100 which will mean you'll never get that strange possibility of a set of very low numbers.

Benj
  • 31,668
  • 17
  • 78
  • 127
  • I would just generate one final delta to get my upper limit, which would be thrown away. Thus this 101st number could be v. large, enabling the circumstance you describe. – Phil H Dec 08 '09 at 10:51
  • Ok so let's say that you want to get 1 -> 100 and you randomly generate what your upper bound is (let's say 67). That will mean your range will naturally tend to be evenly distributed between 1 and 67 rather than between 1 and 100 with 67 happening to be the highest number. It won't look quite the same... – Benj Dec 08 '09 at 10:58
2

Alok's answer and Dan Dyer's comment point out that using an exponential distribution for the deltas would give a uniform distribution of integers.

So the new version of the code sample in the question would be:

import random,sys
running = 0
max = 1000
deltas = [random.expovariate(1.0) for i in range(0,11)]
floats = []
for d in deltas:
    running += d
    floats.append(running)
upper = floats.pop()
ints = [int(round(f/upper*max)) for f in floats]
print(ints)

Note the use of random.expovariate(1.0), a Python exponential distribution random number generator (very useful!). Here it's called with a mean of 1.0, but since the script normalises against the last number in the sequence, the mean itself doesn't matter.

Output (fair dice roll):

[11, 43, 148, 212, 249, 458, 539, 725, 779, 871]
Community
  • 1
  • 1
Phil H
  • 19,928
  • 7
  • 68
  • 105
1

Q: Would the resulting distribution of integers be the same as generating 100 random integers from a uniformly distributed probability density function?

A: Each delta will be uniformly distributed. The central limit theorem tells us that the distribution of a sum of a large number of such deviates (since they have a finite mean and variance) will tend to the normal distribution. Hence the later deviates in your sequence will not be uniformly distributed.

So the short answer is "no". Afraid I cannot give a simple solution without doing algebra I don't have time to do today!

Rupert Nash
  • 1,360
  • 12
  • 20
  • Do you mean that if I have a uniform (unsorted) sequence of random numbers in the `[1..n]` range, the deltas won't be uniformly distributed in the `[0..n-1]` range? – Andreas Brinck Dec 08 '09 at 11:24
  • I mean that the sum of the uniformly distributed deviates (i.e. the deltas) will not be uniformly distributed as required. What you say is also trivially true: since the unsorted numbers you mention will have on average half of the differences negative. – Rupert Nash Dec 08 '09 at 11:57
  • @Andreas: No, They would have a triangular distribution: http://en.wikipedia.org/wiki/Triangular_distribution – Ofri Raviv Dec 08 '09 at 12:23
1

The reference (1979) in Alok's answer is interesting. It gives an algorithm for generating the uniform order statistics not by addition but by successive multiplication:

max = 1.
for i = N downto 1 do
   out[i] = max = max * RAND^(1/i)

where RAND is uniform on [0,1). This way you don't have to normalize at the end, and in fact don't even have to store the numbers in an array; you could use this as an iterator.

The Exponential distribution: theory, methods and applications By N. Balakrishnan, Asit P. Basu gives another derivation of this algorithm on page 22 and credits Malmquist (1950).

Community
  • 1
  • 1
Stanislav
  • 51
  • 1
  • 2
  • 5
  • At first I thought that the Malmquist was the same as the one on whom "Malmquist bias" is named in astronomy, but it turned out to be not so. So there have been at least two famous Malmquist-statisticians :-) – Alok Singhal Dec 13 '09 at 02:52
  • How does this avoid the normalisation? For integers between 1 and 255, your code would just produce values that are increasingly small, and all less than 1. The set must be produced in one go, if there is a max int supplied, because otherwise the mean etc change. – Phil H Dec 14 '09 at 14:55
  • Start with max = 255 and quantize the output with Int(*). Add one to get [1,255]. – Stanislav Dec 15 '09 at 04:29
0

You can do it in two passes;

in the first pass, generate deltas between 0 and (MAX_RAND/n)

in the second pass, normalise the random numbers to be within bounds

Still O(n), with good locality of reference.

Will
  • 73,905
  • 40
  • 169
  • 246