2

I struggle with this simple problem: I want to create some random poll numbers. I have 4 variables I need to fill with data (actually an array of integer). These numbers should represent a random percentage. All percentages added will be 100% . Sounds simple.

But I think it isn't that easy. My first attempt was to generate a random number between 10 and base (base = 100), and substract the number from the base. Did this 3 times, and the last value was assigned the base. Is there a more elegant way to do that?

My question in a few words:

How can I fill this array with random values, which will be 100 when added together?

int values[4];

LuMa
  • 1,673
  • 3
  • 19
  • 41

6 Answers6

5

You need to write your code to emulate what you are simulating.

So if you have four choices, generate a sample size of random number (0..1 * 4) and then sum all the 0's, 1's, 2's, and 3's (remember 4 won't be picked). Then divide the counts by the sample size.

for (each sample) {
   poll = random(choices);
   survey[poll] += 1;
}

It's easy to use a computer to simulate things, simple simulations are very fast.

Keep in mind that you are working with integers, and integers don't divide nicely without converting them to floats or doubles. If you are missing a few percentage points, odds are it has to do with your integers dividing with remainders.

Edwin Buck
  • 69,361
  • 7
  • 100
  • 138
  • 1
    This is the easiest provably unbiased solution; unless performance is an issue this the one that makes the most sense. In C, this would be more like `for (int i = 0; i < 100; i++) { values[rand() % 4]++; }` – user295691 Jan 26 '16 at 20:46
  • Wow exactly what I searched for. Actually quite simple :) – LuMa Jan 26 '16 at 21:37
  • "unbiased" here is relative. This algorithm will produce a random sample of an equally distributed population. However, few populations are equally distributed. A distribution of, say, 40-30-15-15 is a reasonably likely polling result in the real world, but the probability of generating a partition of size 40 with this method is so vanishingly small as to be essentially unobservable. – rici Jan 26 '16 at 22:26
  • @rici True, and to simulate such a sampling, you need a distribution of the results. In such a case, you then determine "cut offs" of a distribution area by the sum of it and the areas already processed, then you pick a random number between it and the population's limit. But, if you had the distribution, then you wouldn't need to simulate the sampling to get the distribution. In this case, I think it is more of an "exercise" than a real world problem being solved. – Edwin Buck Jan 26 '16 at 22:54
  • @EdwinBuck: You might want to do the sampling anyway if you were trying to illustrate sampling theory. A classic Monte Carlo experiment is estimating confidence limits for distribution of parliamentary seats in a proportional system like d'Hondt, given polling results as a proxy for the true voting distribution. It's straight-forward to compute the confidence range for the raw sample numbers, but an analytic solution for the final seat distribution is difficult because of the discontinuities, and the experiment will rapidly produce the probabilities within the envelope. – rici Jan 26 '16 at 23:31
  • ... but there is also the problem of generating test cases for an application which does some sort of presentation or analysis of the partition; for that use case, you might want an unbiased sample of partitions from the universe of possible partitions. – rici Jan 26 '16 at 23:32
  • Since the code posted is not C as the post is tagged, recommend identifying as so as not to confuse C learners. Better yet, incorporate [@user295691 comment](http://stackoverflow.com/questions/35022824/generating-random-poll-numbers/35023096#comment57773681_35023008) – chux - Reinstate Monica Jan 27 '16 at 18:06
2

What you have here is a problem of partitioning the number 100 into 4 random integers. This is called partitioning in number theory.
This problem has been addressed here. The solution presented there does essentially the following:
If computes, how many partitions of an integer n there are in O(n^2) time. This produces a table of size O(n^2) which can then be used to generate the kth partition of n, for any integer k, in O(n) time.
In your case, n = 100, and k = 4.

Reinhard Männer
  • 14,022
  • 5
  • 54
  • 116
1

Generate x1 in range <0..1>, subtract it from 1, then generate x2 in range <0..1-x1> and so on. Last value should not be randomed, but in your case equal 1-x1-x2-x3.

jakubkrol
  • 65
  • 1
  • 9
1

I don't think this is a whole lot prettier than what it sounds like you've already done, but it does work. (The only advantage is it's scalable if you want more than 4 elements).

Make sure you #include <stdlib.h>

int prev_sum = 0, j = 0;
for(j = 0; j < 3; ++j)
{
    values[j] = rand() % (100-prev_sum);
    prev_sum += values[j];
}
values[3] = 100 - prev_sum;
Leejay Schmidt
  • 1,193
  • 1
  • 15
  • 24
1

It takes some work to get a truly unbiased solution to the "random partition" problem. But it's first necessary to understand what "unbiased" means in this context.

One line of reasoning is based on the intuition of a random coin toss. An unbiased coin will come up heads as often as it comes up tails, so we might think that we could produce an unbiased partition of 100 tosses into two parts (head-count and tail-count) by tossing the unbiased coin 100 times and counting. That's the essence of Edwin Buck's proposal, modified to produce a four-partition instead of a two-partition.

However, what we'll find is that many partitions never show up. There are 101 two-partitions of 100 -- {0, 100}, {1, 99} … {100, 0} but the coin sampling solution finds less than half of them in 10,000 tries. As might be expected, the partition {50, 50} is the most common (7.8%), while all of the partitions from {0, 100} to {39, 61} in total achieved less than 1.7% (and, in the trial I did, the partitions from {0, 100} to {31, 69} didn't show up at all.) [Note 1]

So that doesn't seem like a unbiased sample of possible partitions. An unbiased sample of partitions would return every partition with equal probability.

So another temptation would be to select the size of the first part of the partition from all the possible sizes, and then the size of the second part from whatever is left, and so on until we've reached one less than the size of the partition at which point anything left is in the last part. However, this will turn out to be biased as well, because the first part is much more likely to be large than any other part.

Finally, we could enumerate all the possible partitions, and then choose one of them at random. That will obviously be unbiased, but unfortunately there are a lot of possible partitions. For the case of 4-partitions of 100, for example, there are 176,581 possibilities. Perhaps that is feasible in this case, but it doesn't seem like it will lead to a general solution.

For a better algorithm, we can start with the observation that a partition

{p1, p2, p3, p4}

could be rewritten without bias as a cumulative distribution function (CDF):

{p1, p1+p2, p1+p2+p3, p1+p2+p3+p4}

where the last term is just the desired sum, in this case 100.

That is still a collection of four integers in the range [0, 100]; however, it is guaranteed to be in increasing order.

It's not easy to generate a random sorted sequence of four numbers ending in 100, but it is trivial to generate three random integers no greater than 100, sort them, and then find adjacent differences. And that leads to an almost unbiased solution, which is probably close enough for most practical purposes, particularly since the implementation is almost trivial:

(Python)

def random_partition(n, k):
  d = sorted(randrange(n+1) for i in range(k-1))
  return [b - a for a, b in zip([0] + d, d + [n])]

Unfortunately, this is still biased because of the sort. The unsorted list is selected without bias from the universe of possible lists, but the sortation step is not a simple one-to-one match: lists with repeated elements have fewer permutations than lists without repeated elements, so the probability of a particular sorted list without repeats is much higher than the probability of a sorted list with repeats.

As n grows large with respect to k, the number of lists with repeats declines rapidly. (These correspond to final partitions in which one or more of the parts is 0.) In the asymptote, where we are selecting from a continuum and collisions have probability 0, the algorithm is unbiased. Even in the case of n=100, k=4, the bias is probably ignorable for many practical applications. Increasing n to 1000 or 10000 (and then scaling the resulting random partition) would reduce the bias.

There are fast algorithms which can produce unbiased integer partitions, but they are typically either hard to understand or slow. The slow one, which takes time(n), is similar to reservoir sampling; for a faster algorithm, see the work of Jeffrey Vitter.


Notes

  1. Here's the quick-and-dirty Python + shell test:

    $ python -c '
    from random import randrange
    n = 2
    for i in range(10000):
      d = n * [0]
      for j in range(100):
        d[randrange(n)] += 1
      print(' '.join(str(f) for f in d))
    ' | sort -n | uniq -c
    
      1 32 68
      2 34 66
      5 35 65
     15 36 64
     45 37 63
     40 38 62
     66 39 61
    110 40 60
    154 41 59
    219 42 58
    309 43 57
    385 44 56
    462 45 55
    610 46 54
    648 47 53
    717 48 52
    749 49 51
    779 50 50
    788 51 49
    723 52 48
    695 53 47
    591 54 46
    498 55 45
    366 56 44
    318 57 43
    234 58 42
    174 59 41
    118 60 40
     66 61 39
     45 62 38
     22 63 37
     21 64 36
     15 65 35
      2 66 34
      4 67 33
      2 68 32
      1 70 30
      1 71 29
    
Community
  • 1
  • 1
rici
  • 234,347
  • 28
  • 237
  • 341
-2

You can brute force it by, creating a calculation function that adds up the numbers in your array. If they do not equal 100 then regenerate the random values in array, do calculation again.

camel-man
  • 317
  • 1
  • 2
  • 9