19

I need to generate n percentages (integers between 0 and 100) such that the sum of all n numbers adds up to 100.

If I just do nextInt() n times, each time ensuring that the parameter is 100 minus the previously accumulated sum, then my percentages are biased (i.e. the first generated number will usually be largest etc.). How do I do this in an unbiased way?

Nikita Rybak
  • 67,365
  • 22
  • 157
  • 181
erw
  • 191
  • 1
  • 3
  • Interesting question, but I think the answer will not be Java-specific. – Joachim Sauer Jun 09 '10 at 16:56
  • 2
    You could generate randomly until the sum would exceed 100 and then 'cap' the final number. This way all numbers except maybe the final number are random. I don't see how you can 'randomly' arrive at a pre-set sum defined by a non-random constraint. – Alex Jun 09 '10 at 16:58
  • Randomize the order you do them in. Let's say you have 5 numbers then you might do #3 first, then next time #4, next #1 etc ... – Romain Hippeau Jun 09 '10 at 17:02
  • 2
    I think this question can't be answered without first considering what you expect the distribution of these random numbers to look like. If you want a "normal distribution" (bell-curve), then @LanceH's answer should work. If you expect a "uniform distribution", I suspect that's impossible. What distribution you want totally drives the nature of the solution. – Kevin Bourrillion Jun 09 '10 at 18:20
  • 2
    @Kevin. If people omit the probability distribution, I would say it is reasonable to assume uniform. And, getting a uniform distribution is very much possible, why do say it is not? –  Jun 09 '10 at 19:23
  • It just doesn't seem possible to me. Let's say you want five integers that sum to 100. And you want each integer to be uniformly distributed over some range, be it 1-39 or 5-35 or whatever. For the sake of argument, let's say you want the range 10-30 inclusive. And you generate many such five-integer lists. For the distribution to be uniform, 10s would have to appear with just as much likelihood as 20s. Likewise a list containing three 20s should be about as common as a list containing three 10s, except wait! The latter is impossible. See the problem? – Kevin Bourrillion Jun 10 '10 at 16:30
  • @kevin. The method proposed by eruonna will give an even distribution of combinations. That is, (20,20,20,20,20) would be just as frequent as (100,0,0,0,0). Whether that is realistic behavior for a group of %'s or not depends on what they are %'s of. The bell curve approach would map very well to a lot of real world situations like this. Euronna's solution, however, might be good for the selection of an arbitrary set of %'s, say for the start of a simulation or game. – LanceH Jun 12 '10 at 10:25
  • @erw put them in a List and use Collections.shuffle() – josefx Jun 12 '10 at 10:36

12 Answers12

12

A couple of answers suggest picking random percents and taking the differences between them. As Nikita Ryback points out, this will not give the uniform distribution over all possibilities; in particular, zeroes will be less frequent than expected.

To fix this, think of starting with 100 'percents' and inserting dividers. I will show an example with 10:

 % % % % % % % % % % 

There are eleven places we could insert a divider: between any two percents or at the beginning or end. So insert one:

 % % % % / % % % % % % 

This represents choosing four and six. Now insert another divider. This time, there are twelve places, because the divider already inserted creates and extra one. In particular, there are two ways to get

 % % % % / / % % % % % % 

either inserting before or after the previous divider. You can continue the process until you have as many dividers as you need (one fewer than the number of percents.)

 % % / % / % / / % % % / % % % / 

This corresponds to 2,1,1,0,3,3,0.

We can prove that this gives the uniform distribution. The number of compositions of 100 into k parts is the binomial coefficient 100+k-1 choose k-1. That is (100+k-1)(100+k-2)...101 / (k-1)(k-2)*...*2*1 Thus the probability of choosing any particular composition is the reciprocal of this. As we insert dividers one at a time, first we choose from 101 positions, then 102, 103, etc until we get to 100+k-1. So the probability of any particular sequence of insertions is 1 / (100+k-1)*...*101. How many insertion sequences give rise to the same composition? The final composition contains k-1 dividers. They could have been inserted in any order, so there are (k-1)! sequences that give rise to a given composition. So the probability of any particular composition is exactly what it should be.

In actual code, you probably wouldn't represent your steps like this. You should be able to just hold on to numbers, rather than sequences of percents and dividers. I haven't thought about the complexity of this algorithm.

eruonna
  • 121
  • 3
  • +1 This should be equivalent to randomly picking a number that has already been drawn (with multiplicity) or a new one. – starblue Jun 09 '10 at 19:31
  • 1
    It is not because there is the possibility of choosing the space before or after a divider. Essentially, there are two ways to pick a number already drawn. (Or n+1 if that number has been drawn n times.) – eruonna Jun 09 '10 at 19:35
  • Oh, if you mean either picking from the list of already drawn numbers or any number from 0-100 (not necessarily new), then they are the same. You just have to make sure to give the same probability to any choice. – eruonna Jun 09 '10 at 19:38
  • Yes, that's the way to do it. – starblue Jun 10 '10 at 11:04
  • There's something fishy about this to me, although I don't understand it deeply. For a given result, there were multiple possible sequences of divider insertions that would lead to it. When the result contains no zeroes, there were only n! ways to get that result (corresponding to the different orders in which the dividers could have been placed). But when there are zeroes, there's more duplication than that... or is there? Combinatorics is hard. – Kevin Bourrillion Jun 10 '10 at 16:37
  • *"I haven't thought about the complexity of this algorithm."* A simple O(n) implementation would be to generate a number from 1-100, then another from 1-101, another from 1-102, ..., last from 1-(100+n-1). Note that when a new number is generated, all previous numbers larger than it need to be incremented by 1 (which signifies all larger elements being "bumped" to the right in your diagram above). The final set of numbers is the positions of all the dividers. – BlueRaja - Danny Pflughoeft Jun 10 '10 at 20:39
  • Come to think of it, that's `O(n^2)` - whoops! – BlueRaja - Danny Pflughoeft Jun 10 '10 at 20:48
  • @Kevin: There are always exactly (k-1)! ways to get a particular result. To see this, consider that once you know the result, you know the final configuration of dividers, you just don't know what order they were put down in. Any order is possible; you can derive the sequence of insertions by adding dividers one at a time in order. – eruonna Jun 11 '10 at 16:27
  • As far as I know you're right. Did I mention combinatorics is hard? And I even used to be good at it once upon a time. :) – Kevin Bourrillion Jun 11 '10 at 16:43
7

Generate n random integers with any range (call them a[1]..a[n]). Sum up your integers and call that b. Your percentages will be [a[1]/b, ..., a[n]/b].

Edit: good points, rounding the results to total exactly 100 is non-trival. One approach would be to take the floor of a[x]/b for x in 1..n as your integers, then distribute the remainding units 100-(sum of integers) randomly. I'm not sure if this would introduce any bias into the result.

ataylor
  • 64,891
  • 24
  • 161
  • 189
  • 1
    The only problem is that a[0]/b may be not integer. – Nikita Rybak Jun 09 '10 at 17:05
  • @Nikita: So what? Just round so that after rounding the sum = 100. – Ben S Jun 09 '10 at 17:07
  • this seems like a good solution. say the ratio turns out not to be an integer, could I then (at random) pick on of the numbers and add to it the difference between the total sum and 100? This seems to minimize the distribution bias. – erw Jun 09 '10 at 17:08
  • 2
    @Ben That tinkering will make distribution not so even (some combinations will happen more often than others). But I agree that it's pretty close if "any range" is big enough. – Nikita Rybak Jun 09 '10 at 17:11
  • 1
    -1: Sorry, this is useless and making it right will make it unnecessarily complicated. I believe mikera's solution is much better. –  Jun 09 '10 at 18:14
  • After doing this, you'll have a bunch of floats with fractional percentages that sum to 1. Do one more random() and assign that last % point based on the chance that each one gets it. So if you have (33.3,33.3 and 33.4), then the first two would have a 30% chance and the last one a 40% chance of getting that last % point. – LanceH Jun 10 '10 at 08:32
6

You possibly need to define what you really mean by "biased" - but if all you care about is that the distribution of the numbers is independent of their position, then you can simply create the numbers in a "biased" way and then randomise their positions.

Another "unbiased" method would be to create n-1 random percentages, sort them (call this x1 x2 x3...) and then define your final percentages to be:

x1
x2 - x1
x3 - x2
...
100 - x(n-1)

That way you will get n random numbers that add to 100.

mikera
  • 105,238
  • 25
  • 256
  • 415
  • @erw: This solution is much better than the one currently at the top (in number of votes)! It even has one random number call less and involves no rounding etc. Also, I think we can _prove_ that it generates each combination with the same probability. –  Jun 09 '10 at 17:36
  • +1 Same as Adamski's solution, and has same problem with zeroes. But perfect if all elements are > 0. – Nikita Rybak Jun 09 '10 at 17:55
  • @Nikita: If you view that as a generating a multi-set of n-1 numbers, then it gives an unbiased distribution (I think, but haven't tried proving it), as each distribution we need corresponds exactly to one multi-set. My statement about the one random number call less is not really accurate as generating random numbers one by one to create the multiset is not unbiased, as you say. In fact, I believe it is possible to do this with just _one_ random number call if n is small enough. See: http://stackoverflow.com/questions/3003192/five-unique-random-numbers-from-a-subset/3003253#3003253. –  Jun 09 '10 at 18:33
  • To add to previous comment. Since 100 choose 50 < 2^128, we can do this will just two 64 bit random number calls! (I am assuming n <= 100). –  Jun 09 '10 at 18:47
  • 2
    @Moron why would we want to limit ourselves by one random call? And the problem I mentioned (you were arguing with that, right?) is in the fact that he's not generating multisets correctly: there'll be two 'paths' to generate (1, 2) multiset (from sequence [1, 2] and [2, 1] and only one way for (1, 1). So, first one has twice as big probability. – Nikita Rybak Jun 09 '10 at 18:48
  • @Nikita: I agree completely. He is not generating multisets correctly. What I am saying is that a solution to generate the multiset (not a sequence on n-1 numbers) but a _multiset of n-1 numbers_, which we need for unifrom distribution, can be done using _one_ random number call. Maybe I should add that as an answer! –  Jun 09 '10 at 18:51
5

This problem is known as uniform sampling from a simplex and Wikipedia gives two algorithms:

http://en.wikipedia.org/wiki/Simplex#Random_sampling

See also these related questions:

Community
  • 1
  • 1
dreeves
  • 26,430
  • 45
  • 154
  • 229
  • 1
    Unfortunately the wikipedia article is a little inaccessible to most of this audience. I need it dumbed down. :) – Kevin Bourrillion Jun 10 '10 at 16:43
  • @Kevin: The first algorithm is the same as [ataylor's](http://stackoverflow.com/questions/3007975/java-random-percentages/3008067#3008067), the second is the same as [mikera's](http://stackoverflow.com/questions/3007975/java-random-percentages/3008078#3008078) (which unfortunately suffers from the problem Nikita brought up - it's mentioned in the article there's a known problem, but they don't specify what). [eruonna's solution](http://stackoverflow.com/questions/3007975/java-random-percentages/3009154#3009154) solves that problem - interesting that Wikipedia doesn't know about it yet :) – BlueRaja - Danny Pflughoeft Jun 10 '10 at 20:32
  • @BlueRaja: Not true about ataylor's solution. You have to generate from an exponential distribution and then normalize. – dreeves Jun 10 '10 at 22:27
  • @BlueRaja: eruonna's solution has the same (great!) idea as the proof that the total number of sets is 100+N-1 choose N. It uses too many random number calls though. Imagine generating such numbers a million times. Similar ideas can give us the set with just _one_ (or two) random number calls for n<=100. I have described that in my answer. –  Jun 12 '10 at 13:47
3

The key is to generate N random numbers between 0 and 100 but to use these as "markers" rather than the final sequence of numbers to output. Then you iterate through your list of markers in ascending order, calculating each percentage to output as (current marker - previous marker).

This will give a much more even distribution than simply generating and outputting each number one at a time.

Example

import java.util.Random;
import java.util.TreeSet;
import java.util.SortedSet;

public class Main {
  public static void main(String[] args) {
    Random rnd = new Random();
    SortedSet<Integer> set = new TreeSet<Integer>();

    for (int i=0; i<9; ++i) {
      set.add(rnd.nextInt(101));
    }

    if (set.last() < 100) {
      set.add(100);
    }    

    int prev = 0;
    int total = 0;    
    int output;

    for (int j : set) {
      output = j - prev;
      total += output;
      System.err.println(String.format("Value: %d, Output: %d, Total So Far: %d", j, output, total));
      prev = j;
    }
  }
}

Output

$ java Main
Value: 0, Output: 0, Total So Far: 0
Value: 2, Output: 2, Total So Far: 2
Value: 55, Output: 53, Total So Far: 55
Value: 56, Output: 1, Total So Far: 56
Value: 57, Output: 1, Total So Far: 57
Value: 69, Output: 12, Total So Far: 69
Value: 71, Output: 2, Total So Far: 71
Value: 80, Output: 9, Total So Far: 80
Value: 92, Output: 12, Total So Far: 92
Value: 100, Output: 8, Total So Far: 100
Adamski
  • 54,009
  • 15
  • 113
  • 152
  • I suspect the distributions will be same as ataylor or my solutions. Yours however involves inserting into a BST so O(NlogN) rather than O(N). Also you need to add a last entry to get sum to 100 – Il-Bhima Jun 09 '10 at 17:21
  • +1, I like the idea. Each valid distribut translates into a set of (n - 1) markers, so we could generate markers as well. The code looks quite difficult, though, I'm sure it can be made simpler :) – Nikita Rybak Jun 09 '10 at 17:46
  • It does have a problem with zeroes, though. Set of markers (1, 2, .., n - 1) will be generated in n! cases (considering order), while set of markers (100, 100, .., 100) (leading to distribution 100, 0, .., 0) will be generated only in one case. – Nikita Rybak Jun 09 '10 at 17:53
3

Make an array. Randomly drop 100 %'s into each of the parts of that array. Example shows n=7.

import java.util.Random;

public class random100 {
    public static void main (String [] args) {
        Random rnd = new Random();
            int percents[] = new int[7];
            for (int i = 0; i < 100; i++) {
                int bucket = rnd.nextInt(7);
                percents[bucket] = percents[bucket] + 1;
            }
        for (int i = 0; i < 7; i++) {
            System.out.println("bucket " + i + ": " + percents[i]);
        }

    }

}
LanceH
  • 1,726
  • 13
  • 20
  • 2
    Nope :) You program for n == 2 has only one way of reaching distribution (100, 0): rnd.nextInt == 0 on each step. But there're lots of ways to reach (50, 50): binomial coefficient from 50 and 100. And both distributions should have equal probability. – Nikita Rybak Jun 09 '10 at 17:33
  • rnd.nextInt(2) yields 0 or 1, seems to work just fine. I'm getting results of (46, 54), (48, 52), (56,44), etc... – LanceH Jun 09 '10 at 17:45
  • @LanceH notice how your examples are clustered close to (50,50). Those closer to (50,50) have a much higher probability of occuring than e.g. (100, 0). – ataylor Jun 09 '10 at 17:50
  • 1
    Misread what Nikita wrote. Yes, it tends toward the middle, which looks particularly bad for n=2. But with n=2 the solution is a trivial through a simple random of the first number and the second number is determined. For n=3, no expectation is mentioned. If distribution is equal between all 3, then distribution for a single bucket is not going to be uniform from 0 to 100...which *could* imply the same thing for n=2. At the least it implies that n=2 is an entirely different beast than n>2. There is no mention of expected distribution. – LanceH Jun 09 '10 at 18:04
  • Exactly true that this answer is perfectly valid since no expected distribution was mentioned. – Kevin Bourrillion Jun 10 '10 at 16:44
2

To be precise it depends on exactly how you want the samples to be unbiased. Here is a rough way which will roughly give you a good result.

  1. Generate n-1 integers from 0,..100, say a[i] for i = 0, to n-2.
  2. Let total be the sum of these numbers
  3. Compute b[i] = floor(100*a[i]/total) for i = 0, to n-2
  4. Set b[n-1] = 100 - (b[0] + ... b[n-2]).

Then b is your resulting array of percentages.

The last one will be biased, but the rest should be uniform.

Of course if you want to do this in a more accurate way you'll have to use Gibbs sampling or Metropolis hastings.

Il-Bhima
  • 10,744
  • 1
  • 47
  • 51
  • I think this is the way to go. If you could remove the integer restriction, you could get rid of the rounding which may lead to some biases, but the whole process of genererating 100 numbers and then normalizing seems sound. – Grembo Jun 09 '10 at 19:54
0

Here is some code that I wrote for a program I am creating. I found this thread when trying to solve this exact problem, so hopefully this will help some others. The design was based on reading @eruonna response above.

public static int[] randomNumbers(int numOfNumbers){

    int percentN = numOfNumbers;

    int[] intArray = new int[101];

    //set up the array with values
    for(int i = 0; i < intArray.length; i++){
        intArray[i] = i;
    }

    //set up an array to hold the selected values
    int[] selectionArray = new int[(percentN - 1)];

    //run a for loop to go through and select random numbers from the intArray
    for(int n = 0; n < selectionArray.length; n++){
        int randomNum = (int)(Math.random() * 100);
        selectionArray[n] = intArray[randomNum];
    }

    //bubble sort the items in the selectionArray
    for(int out = (selectionArray.length - 1); out > 1; out--){
        for(int in = 0; in < out; in++){
            if(selectionArray[in] > selectionArray[in + 1]){
                int temp = selectionArray[in];
                selectionArray[in] = selectionArray[in + 1];
                selectionArray[in + 1] = temp;
            }
        }
    }

    //create an array to hold the calculated differences between each of the values to create random numbers
    int[] calculationArray = new int[percentN];

    //calculate the difference between the first item in the array and 0
    calculationArray[0] = (selectionArray[0] - 0);

    //calculate the difference between the other items in the array (except for the last value)
    for(int z = 1; z < (calculationArray.length - 1); z++){
        calculationArray[z] = (selectionArray[z] - selectionArray[z - 1]);
    }

    //calculate the difference for the last item in the array
    calculationArray[(calculationArray.length - 1)] = (100 - selectionArray[(selectionArray.length - 1)]);

    return calculationArray;

}
Declan
  • 448
  • 10
  • 27
0

Once you pick the numbers with the method you describe, shuffle the order of the numbers. This way the final list of numbers has a more even distribution.

However, note that no matter what you do, you can't get a perfectly even distribution since once you start to choose numbers, your random trials are not independent. See ataylor's answer.

Note also that the algorithm you describe might not give you the required output. The last number cannot be random since it must make the sum equal 100.

Ben S
  • 68,394
  • 30
  • 171
  • 212
0

First, obvious solution.

do
    int[] a = new int[n];
    for (int i = 0; i < n; ++i) {
        a[i] = random number between 0 and 100;
    }
until sum(a) == 100;

It's not perfect in terms of complexity (number of iterations to reach sum 100 can be quite large), but distribution is surely 'unbiased'.

edit
Similar problem: how to generate random point in a circle with radius 1 and center in (0, 0)? Solution: continue generating random points in range (square) [-1..1,-1..1] until one of them fits the circle :)

Nikita Rybak
  • 67,365
  • 22
  • 157
  • 181
  • What if the first random number is 99 and the second one is 2? – Joachim Sauer Jun 09 '10 at 17:02
  • What if `n=3` and the first two numbers sum to > 100? – Ben S Jun 09 '10 at 17:03
  • @Joachim, @Ben Then final sum will be bigger than 100 and we'll have to repeat process. – Nikita Rybak Jun 09 '10 at 17:03
  • This won't work - The OP wants to generate a specific number of percentages but your code could potentially generate just 1 (if the first random number generated is 100). – Adamski Jun 09 '10 at 17:05
  • @Adamski Please check the code. It will _always_ generate n integers, even if first one is 100. – Nikita Rybak Jun 09 '10 at 17:06
  • @Nikita, That may be the case but it won't generate N integers that sum to 100 as requested. – Adamski Jun 09 '10 at 17:14
  • @Adamski You don't believe in power of random? :) Repeating 'do .. until' 1000 times should be enough for most values of n. – Nikita Rybak Jun 09 '10 at 17:18
  • By "most values of n" do you mean any integer between 1 and infinity? – Adamski Jun 09 '10 at 17:21
  • 1
    lols this is the equivalent of the bogosort. The algorithm is correct but will require an enormous amount of iterations to complete. I'm working out the number now! I hope you meant it as a joke :) – Il-Bhima Jun 09 '10 at 17:25
  • @Adamski Ok, smart pants, I mentioned complexity issue in my post :) – Nikita Rybak Jun 09 '10 at 17:28
  • @Il-Bhima yeah, except that success probability is pretty decent for not too big n (unlike 1/n! with bogosort) :) – Nikita Rybak Jun 09 '10 at 18:06
  • For two numbers the probability is 1/100, so you'd need 100 iterations for just two, and for n = 3 its much larger. I would say its at least exponential. – Il-Bhima Jun 09 '10 at 18:26
  • @Il-Bhima You're right, complexity will be big, even though it's not big at all for 3 :) With n == 3, maximum possible sum is 300. The most probable - 150 (p(150)>=1/300). We want 100, which is not so far away from 150 and not much less probable. Maybe, when I'm cured of laziness, I'll even test it to see how it works for n = 5, 10 or 15! – Nikita Rybak Jun 09 '10 at 19:04
0

I had a similar issue and ended up doing as you said, generating random integers up to the difference of the sum of the existing integers and the limit. I then randomized the order of the integers. It worked pretty well. That was for a genetic algorithm.

Caleb Hearth
  • 3,315
  • 5
  • 30
  • 44
0

Imagine you have 100 stones and N buckets to place them in. You can take all 100 and place on in a random bucket. This way the total will be the 100 you started with and there will be no bias between any bucket.

public static int[] randomBuckets(int total, int n_buckets) {
    int[] buckets = new int[n_buckets];
    Random rand = new Random();
    for(int i=0;i<total;i++)
        buckets[rand.nextInt(n_buckets)]++;
    return buckets;
}

public static void main(String... args) {
    for(int i=2; i<=10;i++)
        System.out.println(Arrays.toString(randomBuckets(100, i)));
}

Prints

[55, 45]
[38, 34, 28]
[22, 21, 32, 25]
[28, 24, 18, 15, 15]
[17, 14, 13, 21, 18, 17]
[17, 19, 14, 15, 6, 15, 14]
[11, 14, 14, 14, 4, 17, 9, 17]
[13, 12, 15, 12, 8, 10, 9, 11, 10]
[11, 13, 12, 6, 6, 11, 13, 3, 15, 10]

As the count increases, the distribution approaches uniform.

System.out.println(Arrays.toString(randomBuckets(100000000, 100)));

Prints

[1000076, 1000612, 999600, 999480, 998226, 998303, 1000528, 1000450, 999529, 
998480, 998903, 1002685, 999230, 1000631, 1001171, 997757, 1000349, 1000527, 
1002408, 1000852, 1000450, 999318, 999453, 1000099, 1000759, 1000426, 999404, 
1000758, 1000939, 999950, 1000493, 1001396, 1001007, 999258, 1001709, 1000593,
1000614, 1000667, 1000168, 999448, 999350, 1000479, 999991, 999778, 1000513, 
998812, 1001295, 999314, 1000738, 1000211, 999855, 999349, 999842, 999635, 
999301, 1001707, 998224, 1000577, 999405, 998760, 1000036, 1000110, 1002471, 
1000234, 1000975, 998688, 999434, 999660, 1001741, 999834, 998855, 1001009, 
999523, 1000207, 998885, 999598, 998375, 1000319, 1000660, 1001727, 1000546, 
1000438, 999815, 998121, 1001128, 1000191, 998609, 998535, 999617, 1001895, 
999230, 998968, 999844, 999392, 999669, 999407, 998380, 1000732, 998778, 1000522]
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • By the central limit theorem this will tend to a normal distribution about the centre (total/2) as n_buckets->infinity. I think what the OP wants is to have each possible combination equally likely (i.e a uniform distribution on the set of all n integers adding up to 100). – Il-Bhima Jun 09 '10 at 20:28
  • @LL-Bhima Every bucket has an equal chance of getting being increamented, why would you suspect anything other than a equally flat distribution? – Peter Lawrey Jun 10 '10 at 19:57
  • @Kevin The result is not completely uniform, by the difference between the lowest and highest in the last example is much less than 1%. – Peter Lawrey Jun 10 '10 at 20:07