Dealing with normal (Gaussian) distribution

Question

I basically stucked on rather simple problem:

Toss N coins and discover, how many of them land heads

Solution performance must not depend on N, so we can't just call Math.random() < 0.5 N times. Obviously, there is Gaussian distribution to the rescue.

I used Box-Muller method for it:

function gaussian_random(mean, variance) {
  var s;
  var x;
  var y;
  do {
    x = Math.random() * 2.0 - 1.0;
    y = Math.random() * 2.0 - 1.0;
    s = Math.pow(x, 2) + Math.pow(y, 2);
  } while ( (s > 1) || (s == 0) );

  var gaussian = x * Math.sqrt(-2*Math.log(s)/s);
  return mean + gaussian * Math.sqrt(variance);
}

Math says, that mean of N coin tosses is N/2 and variance is N/4

Then, I made test, which tosses N coins M times, giving Minimum, Maximum, and Average number of heads.

I compared results of naive approach (Math.random() many times) and Gaussian Box-Muller approach.

There is typical output of tests:

Toss 1000 coins, 10000 times
Straight method: 
Elapsed time: 127.330 ms
Minimum: 434
Maximum: 558
Average: 500.0306
Box-Muller method: 
Elapsed time: 2.575 ms
Minimum: 438.0112461962819
Maximum: 562.9739632480057
Average: 499.96195358695064

Toss 10 coins, 10000 times
Straight method: 
Elapsed time: 2.100 ms
Minimum: 0
Maximum: 10
Average: 5.024
Box-Muller method: 
Elapsed time: 2.270 ms
Minimum: -1.1728354576573263
Maximum: 11.169478925333504
Average: 5.010078819562535

As we can see on N = 1000 it fits almost perfectly.

BUT, on N = 10 Box-Muller goes crazy: it allows such tests outcomes, where i can get (quite rarely, but it is possible) 11.17 heads from 10 coin tosses! :)

No doubt, i am doing something wrong. But what exactly?

There is source of test, and link to launch it

Updated x2: it seems, previously I didn't describe problem well. There is general version of it:

How to get sample mean of N uniform random values (either discrete or continuous) in amortized constant time. Gaussian distribution is efficient for large N, but is there a way to make it work good on small N? Or on which exactly N, solution should switch from Gaussian method to some other (for exampled straightforward).

I'd say you need a [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution) with `p = 0.5`, not a Gaussian. If N is large enough, a normal is a reasonable approximation of a binomial distribution. But certainly not for `N = 10`. — Stefan Zobel, Sep 30 '16 at 13:29
from Box-Muller point of view there is nothing crazy. He has no means of coins and gives a "spill" on calculation based on parameters you've provided. On low sample count this spill breaks boundaries of your sampling because formulae knows nothing about them. — Sergey Grinev, Sep 30 '16 at 14:23
@StefanZobel in truth, `p = 0.5` case is planned to be a first step; next i want to solve problem for any `p` and `N`. That's why binomial distribution doesn't fit, I am seeking general solution. @SergeyGrinev i understand, that Box-Muller does exactly what it designed to do. However, its limitation of use was surprising. — augur, Sep 30 '16 at 15:58
But @pjs already [answered](http://stackoverflow.com/questions/39786686/dealing-with-normal-gaussian-distribution/39794806#39794806) your question. The binomial distribution is the correct one, regardless of the `p` you choose. — Stefan Zobel, Sep 30 '16 at 16:17
And if `p` itself is a random variable then you should consider the [beta-binomial](https://en.wikipedia.org/wiki/Beta-binomial_distribution) distribution. — Stefan Zobel, Sep 30 '16 at 16:32

score 2 · Accepted Answer · answered Sep 30 '16 at 15:14

2

Math says, that mean of N coin tosses is N/2 and variance is N/4.

Math only says that if it's a fair coin. And there's no way the solution doesn't depend on N.

Assuming all tosses are independent of each other, for exact behaviors use a binomial distribution rather than a normal distribution. Binomial has two parameters: N is the number of coin tosses, and p is the probability of getting heads (or tails if you prefer). In pseudocode...

function binomial(n, p) {
  counter = 0
  successes = 0
  while counter < n {
    if Math.random() <= p
       successes += 1
    counter += 1
  }
  return successes
}

There are faster algorithms for large N, but this is straightforward and mathematically correct.

answered Sep 30 '16 at 15:14

pjs

18,696
4
27
56

I understand, that original problem belongs to binomial distribution. Maybe it was my mistake to reduce problem to discrete coin flipping. There is more general problem i planned to solve by dealing with coin flipping first: **how to get mean value of N rolls of uniformed random value [0, 1)**, without actually generating N random values. Straightforward solution performance is O(N), which I try to avoid. – augur Sep 30 '16 at 16:15
3

The best way to get answers is to ask the actual question you want the answer to. If you want the actual mean of N independent and identically distributed occurrences of *any* distribution it's identical to the mean of an occurrence. The *sample* mean a.k.a. average, on the other hand, has the same expected value but has variability, so trials will yield different outcomes but the long-run average will converge to the mean. – pjs Sep 30 '16 at 16:30
2

If you want to generate empirical trials and don't like O(N), see http://stackoverflow.com/questions/23561551/a-efficient-binomial-random-number-generator-code-in-java/23574723#23574723 for an O(Np) algorithm. – pjs Sep 30 '16 at 16:50
Sorry for misleading description. What I want is average (*sample* mean) of N random values (single trial). And since multiple uniformed random values form Gaussian distribution, i thought it is reasonable to utilize this feature. It works in amortized constant time, completely independent of N. But there are limitations on low N. I hope somebody would clarify, where exactly they start. – augur Sep 30 '16 at 17:28
3

@augur Except your assumption about the Gaussian is wrong. The Gaussian is a decent approximation to the binomial distribution for large sample sizes (and with what's known as a "continuity correction" if you want probability of the outcomes falling in a range). How large a sample size is required? Opinions differ, but a reasonable rule of thumb is that both `n*p` and `n*(1-p)` should be at least 10 to prevent too much impact the edge effects of approximating a bounded range distribution using an infinite range one. – pjs Sep 30 '16 at 17:47
In your specific case with N = 10 and p = 0.5, you're well below that "safe" threshold. – pjs Sep 30 '16 at 17:47
I meant, independent of N *growth*. http://stackoverflow.com/questions/23561551/a-efficient-binomial-random-number-generator-code-in-java/23574723#23574723 seems to work in amortized constant time, but can it handle random values from continuous distribution? – augur Sep 30 '16 at 17:50
if i understand it right, https://en.wikipedia.org/wiki/Central_limit_theorem says that Gaussian is approximation for large numbers not only to binomial, but to all distributions that provide independent and identically distributed values. In fact, approximation is acceptable for me, because in current environment, estimated sample sizes are up to 10^7. In addition, this problem interested me, so I am in search of universal solution too. – augur Sep 30 '16 at 18:04
2

@augur You were asking about the "crazy" values you got for N = 10. There are two comments here that told you that the normal is **not** a good approximation for N = 10 (assuming p = 0.5). Now you are saying that you are ok with an (any?) approximation. What's your question? – apophis Sep 30 '16 at 18:23
@apophis i meant, approximation is ok because my particular problem operates large sample sizes. But I also seek general, universal solution, which gonna work good on various sample sizes. Updated topic post, tried to clarify question. – augur Sep 30 '16 at 18:34
2

@augur This is known as "moving the goal posts." Not cool! The correct thing to do is post a different question. – pjs Sep 30 '16 at 18:40
2

@augur See [here](https://en.wikipedia.org/wiki/Binomial_distribution#Normal_approximation) (that is what @pjs was talking about). If `n*p` and `n*(1-p)` are large enough go with the normal, otherwise use the binomial. – Stefan Zobel Sep 30 '16 at 18:41
2

@augur I'm not sure that you do actually understand the CLT. It talks about the distributional behavior of sums of random variables when the quantity being summed gets large. It works for binomials because the binomial value is the sum of all the Bernoulli (0/1) trials. It works for sample means (if you have a sufficiently large sample) because you sum the outcomes and then scale the result by the sample size. It doesn't apply to anything other than sums, there have to be enough values being summed, and the rate of the sum's convergence to Gaussian depends on a myriad of factors. – pjs Sep 30 '16 at 18:51
my bad. However, our conversation still was quite helpful, thanks. @StefanZobel gonna use it! – augur Sep 30 '16 at 18:52

score 0 · Answer 2 · edited May 23 '17 at 12:15

Based on what discussed in approved answer i came up to this particular solution.

There is rule of thumb n*p >= 10 and n*(1-p) >= 10, but lets define stricter one.

First of all, Box-Muller gonna be hard capped at [-6,6], to ensure proper outcome (640 kB ought..., I mean, 6 sigmas ought to be enough for everybody).

Then, using 6 constant, we state, that in order for Box-Muller to produce valid results, Math.sqrt(variance) * 6 must not exceed mean. After using N/2 and N/4 as mean and variance respectively, and reductions, we end up with this:

Math.sqrt(N) * 6 <= N

N >= 36

While this condition is true, we can safely use capped Box-Muller Gaussian. On any lower sample size, stick to the Binomial solution.

Just checked this rule statistically - on relatively large amount (10 million) of tests, minimum value stops falling out of boundaries from sample size 32 and above.

Commited [mentioned changes](https://github.com/augur/augur.github.io/commit/c7c7c794fa37ee3f0a534159cc4e056358c0e8db) to test example. — augur, Oct 01 '16 at 07:10

Dealing with normal (Gaussian) distribution

2 Answers2