2

I'm writing a short program to approximate the definite integral of the gaussian function f(x) = exp(-x^2/2), and my codes are as follows:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

double gaussian(double x) {
    return exp((-pow(x,2))/2);
}

int main(void) {
    srand(0);
    double valIntegral, yReal = 0, xRand, yRand, yBound;
    int xMin, xMax, numTrials, countY = 0;

    do {
        printf("Please enter the number of trials (n): ");
        scanf("%d", &numTrials);
        if (numTrials < 1) {
            printf("Exiting.\n");
            return 0;
        }  
        printf("Enter the interval of integration (a b): ");
        scanf("%d %d", &xMin, &xMax);      
        while (xMin > xMax) { //keeps looping until a valid interval is entered
            printf("Invalid interval!\n");
            printf("Enter the interval of integration (a b): ");
            scanf("%d %d", &xMin, &xMax);
        }
        //check real y upper bound
        if (gaussian((double)xMax) > gaussian((double)xMin))
            yBound = gaussian((double)xMax);
        else 
            yBound = gaussian((double)xMin);
        for (int i = 0; i < numTrials; i++) {
            xRand = (rand()% ((xMax-xMin)*1000 + 1))/1000.00 + xMin; //generate random x value between xMin and xMax to 3 decimal places             
            yRand = (rand()% (int)(yBound*1000 + 1))/1000.00; //generate random y value between 0 and yBound to 3 decimal places
            yReal = gaussian(xRand);
            if (yRand < yReal) 
                countY++;
        }
        valIntegral = (xMax-xMin)*((double)countY/numTrials);
        printf("Integral of exp(-x^2/2) on [%.3lf, %.3lf] with n = %d trials is: %.3lf\n\n", (double)xMin, (double)xMax, numTrials, valIntegral);

        countY = 0; //reset countY to 0 for the next run
    } while (numTrials >= 1);

    return 0;
}

However, the outputs from my code doesn't match the solutions. I tried to debug and print out all xRand, yRand and yReal values for 100 trials (and checked yReal value with particular xRand values with Matlab, in case I had any typos), and those values didn't seem to be out of range in any way... I don't know where my mistake is.

The correct output for # of trials = 100 on [0, 1] is 0.810, and mine is 0.880; correct output for # of trials = 50 on [-1, 0] is 0.900, and mine was 0.940. Can anyone find where I did wrong? Thanks a lot.

Another question is, I can't find a reference to the use of following code:

double randomNumber = rand() / (double) RAND MAX;

but it was provided by the instructor and he said it would generate a random number from 0 to 1. Why did he use '/' instead of '%' after "rand()"?

gouhaha
  • 137
  • 7
  • 2
    `rand()` is a poor random number generator, and you might want to add an option to run trials with different seed options to `srand()`. Also using: `rand() % ...` introduces a bias to the random value. Consider using a regular dice to find values from `1` .. `4` with `dice() % 4 + 1`. You will not get a random distribution - `2` and `3` will occur more frequently. – Brett Hale May 28 '17 at 14:01

2 Answers2

9

There's a few logical errors / discussion points in your code, both mathematics and programming-wise.

First of all, just to get it out of the way, we're talking about the standard gaussian here, i.e.

[latex] $$ \mathcal {N} (x ~|~ \mu = 0, \sigma^2 = 1) ~=~ \frac {1} {\sqrt {2 \pi}} \exp \Biggl \{ - \frac {x^2} {2} \Biggr \} $$ [/latex]

except, the definition of the gaussian on line 6, omits the [latex] $$ \sqrt{2\pi} $$ [/latex] normalising term. Given the outputs you seem to expect, this seems to have been done on purpose. Fair enough. But if you wanted to calculate the actual integral, such that a practically infinite range (e.g. [-1000, 1000]) would sum up to 1, then you would need that term.


Is my code logically correct?

No. Your code has two logical errors: one on line 29 (i.e. your if statement), and one on line 40 (i.e. the calculation of valIntegral), which is a direct consequence of the first logical error.

For the first error, consider the following plot to see why:

Your Monte Carlo process effectively considers a bounded box over a certain range, and then says "I will randomly place points inside this box, and then count the proportion of the total number of points that randomly fell under the curve; the integral estimate is then the area of the bounded box itself, times this proportion".

Now, if both [latex] $$ x \! _ { _ { m \! i \! n }} $$ [/latex] and [latex] $$ x \! _ { _ { m \! a \! x }} $$ [/latex] are to the left of the mean (i.e. 0), then your if statement correctly sets the box's upper bound (i.e. yBound) to [latex] $$ \mathcal {N} ( x \! _ { _ {m \! a \! x}} ) $$ [/latex] such that the topmost bound of the box contains the highest part of that curve. So, e.g., to estimate the integral for the range [-2,-1], you set the upper bound to [latex] $$ \mathcal {N} ( - 1 ) $$ [/latex] .

Similarly, if both [latex] $$ x \! _ { _ { m \! i \! n }} $$ [/latex] and [latex] $$ x \! _ { _ { m \! a \! x }} $$ [/latex] are to the right of the mean, then you correctly set yBound to [latex] $$ \mathcal {N} ( x \! _ { _ {m \! i \! n}} ) $$ [/latex]

However, if [latex] $$ x ! _ { _ {m ! i ! n}} < 0 < x ! _ { _ {m ! a ! x}} $$ [/latex] , you should be setting yBound to neither [latex] $$ x \! _ { _ { m \! i \! n }} $$ [/latex] nor [latex] $$ x \! _ { _ { m \! a \! x }} $$ [/latex] , since the 0 point is higher than both!. So in this case, your yBound should simply be at the peak of the Gaussian, i.e. [latex] $$ \mathcal {N} ( 0 ) $$ [/latex] (which in your case of an unnormalised Gaussian, this takes a value of '1').

Therefore, the correct if statement is as follows:

if (xMax < 0.0)
  { yBound = gaussian((double)xMax); }
else if (xMin > 0.0)
  { yBound = gaussian((double)xMin); }
else
  { yBound = gaussian(0.0); }

As for the second logical error, we already mentioned that the value of the integral is the "area of the bounding box" times the "proportion of successes". However, you seem to ignore the height of the box in your calculation. It is true that in the special case where [latex] $$ x ! _ { _ {m ! i ! n}} < 0 < x ! _ { _ {m ! a ! x}} $$ [/latex] , the height of your unnormalised Gaussian function defaults to '1', therefore this term can be omitted. I suspect that this is why it may have been missed. However, in the other two cases, the height of the bounding box is necessarily less than 1, and therefore needs to be included in the calculation. So the correct code for line 40 should be:

valIntegral = yBound * (xMax-xMin) * (((double)countY)/numTrials);

Why am I not getting the correct output?

Even despite the above logical errors, as we've discussed above, your output should have been correct for the specific intervals [0,1] and [-1,0] (since they include the mean and therefore the correct yBound of 1). So why are you still getting a 'wrong' output?

The answer is, you are not. Your output is "correct". Except, a Monte Carlo process involves randomness, and 100 trials is not a big enough number to lead to consistent results. If you run the same range for 100 trials again and again, you'll see you'll get very different results each time (though, overall, they'll be distributed around the right value). Run with 1000000 trials, and you'll see that the result becomes a lot more precise.


What's up with that randomNumber code?

The rand() function returns an integer in the range [0, RAND_MAX], where RAND_MAX is system-specific (have a look at man 3 rand).

The modulo approach (i.e. %) works as follows: consider the range [-0.1, 0.3]. This range spans 0.4 units. 0.4 * 1000 + 1 = 401. For a random number from 0 to RAND_MAX, doing rand() modulo 401 will always result in a random number in the range [0,400]. If you then divide this back by 1000, you get a random number in the range [0, 0.4]. Add this to your xmin offset (here: -0.1) and you get a random number in the range [-0.1, 0.3].

In theory, this makes sense. However, unfortunately, as already pointed out in the other answer here, as a method it is susceptible to modulo bias, because RAND_MAX isn't necessarily exactly divisible by 401, therefore the top part of that range leading up to RAND_MAX overrepresents some numbers compared to others.

By contrast, the approach given to you by your teacher is simply saying: divide the result of the rand() function with RAND_MAX. This effectively normalises the returned random number into the range [0,1]. This is a much more straightforward thing to do, and it avoids modulo bias.

Therefore, the way I would implement this would be to make it into a function:

double randomNumber(void) {
  return rand() / (double) RAND_MAX;
}

which then simplifies your computations as follows too:

xRand = randomNumber() * (xMax-xMin) + xMin;
yRand = randomNumber() * yBound;

You can see that this is a much more accurate thing to do, if you use a normalised gaussian, i.e.

double gaussian(double x) {
  return exp((-pow(x,2.0))/2.0) / sqrt(2.0 * M_PI);
}

and then compare the two methods. You will see that the randomNumber() method for an "effectively infinite" range (e.g. [-1000,1000]) gives the correct result of 1, whereas the modulo approach tends to give numbers that are larger than 1.

Tasos Papastylianou
  • 21,371
  • 2
  • 28
  • 57
  • Nice explanation. I would urge @Gouhaha to accept this as the answer. – John Coleman May 29 '17 at 01:37
  • Thank you so much for your detailed explanation! – gouhaha May 29 '17 at 22:42
  • I'd suggest one slight correction to an otherwise excellent answer. The bounding box technique works when `yBound` is >= the maximum in the range. Having a larger value than needed still yields answers with the correct expected value, but with larger variance than if max(y) for the range is used. – pjs May 30 '17 at 13:57
  • As an alternative when the bounding box is not "tight" throughout the range, you can get an even lower variance estimator by generating random `x` values in the specified range and estimating the average height of the resulting `gaussian(x)` values. Average height times interval width is an unbiased estimate of the area, and no fiddling around with determining upper bounds. – pjs May 30 '17 at 14:00
3

Your code has no obvious bug (though there is a bug in the upper bound calculation, as @TasosPapastylianou points out, though it isn't the issue in your test cases). On 100 trials, your answer of 0.880 is closer to the actual value of the integral (0.855624...) than 0.810, and neither of those numbers are so far from the true value to suggest an outright bug in the code. Seems to be within sampling error (though see below). Here is a histogram of 1000 runs of a Monte Carlo integration (done in R, but with the same algorithm) of e^(-x^2/2) on [0,1] with 100 trials:

enter image description here

Unless your instructor specified the algorithm and the seed in precise detail, you shouldn't expect the exact same answer.

As far as your second question about rand() / (double) RAND MAX: it is an attempt to avoid modulo bias. It is possible that such a bias is effecting your code (especially given the way you round to 3 decimal places), since it does seem to overestimate the integral (based on running it a dozen times or so). Perhaps you could use that in your code and see if you get better results.

John Coleman
  • 51,337
  • 7
  • 54
  • 119
  • 1
    and isn't it better to use `x*x` instead of `pow(x,2)` ? – Jean-François Fabre May 28 '17 at 12:51
  • @Jean-FrançoisFabre Good point. It would be more efficient to do so, but it is unlikely to be the issue here. – John Coleman May 28 '17 at 12:54
  • I disagree there's "no obvious bug" (run the range [-1,1] to see why). Of course there may be background to this exercise that I'm ignoring (e.g. the intervals [0,1] and [-1,0] are the only ones we care about), but I have no reason to think that from the question description alone. – Tasos Papastylianou May 29 '17 at 01:29
  • @TasosPapastylianou I never said that there was no bug, just that there was no obvious bug. You are correct that there is a bug with the upper bound calculation which shows up when the interval of integration straddles the origin. Since this didn't arise in the test cases I didn't pay too much attention to that part of the code. – John Coleman May 29 '17 at 01:40
  • Thanks; sorry, I didn't mean to sound belligerent. I was more wondering out loud if this is a known assignment whose specific assumptions I was ignoring. Thanks for clarifying! – Tasos Papastylianou May 29 '17 at 01:46
  • @TasosPapastylianou You didn't sound belligerent, you just sounded correct. I am embarrassed to have missed those bugs. – John Coleman May 29 '17 at 01:51