Correct OpenMP pragmas for pi monte carlo in C with not thread-safe random number generator

Question

I need some help to parallelize the pi calculation with the monte carlo method with openmp by a given random number generator, which is not thread safe.

First: This SO thread didn't help me.

My own try is the following #pragma omp statements. I thought the i, x and y vars should be init by each thread and should than be private. z ist the sum of all hits in the circle, so it should be summed after the implied barriere after the for loop.

Think the main problem ist the static state var of the random number generator. I made a critical section where the functions are called, so that only one thread per time could execute it. But the Pi solutions doesn't scale with more higher values.

Note: I should not use another RNG, but its okay to make little changes on it.

int main (int argc, char *argv[]) {

    int i, z = 0, threads = 8, iters = 100000;
    double x,y, pi;

    #pragma omp parallel firstprivate(i,x,y) reduction(+:z) num_threads(threads)
        for (i=0; i<iters; ++i) {
            #pragma omp critical
            {
                x = rng_doub(1.0);
                y = rng_doub(1.0);
            }
            if ((x*x+y*y) <= 1.0)
                z++;
        }

    pi = ((double) z / (double) (iters*threads))*4.0;
    printf("Pi: %lf\n", pi);;
    return 0;
}

This RNG is actually an included file, but as I'm not sure if I create the header file correct, I integrated it in the other program file, so I have only one .c file.

#define RNG_MOD 741025

int rng_int(void) {
    static int state = 0;

    return (state = (1366 * state + 150889) % RNG_MOD);
}

double rng_doub(double range) {
    return ((double) rng_int()) / (double) ((RNG_MOD - 1)/range);
}

I've also tried to make the static int state global, but it doesn't change my result, maybe I done it wrong. So please could you help me make the correct changes? Thank you very much!

Could you please clarify the meaning of the words: _"But the Pi solutions doesn't scale with more higher values."_ Apart from the fact that the critical section serialises your threads and you won't get any speed-up, the code looks correct to me. — Hristo Iliev, Dec 08 '13 at 14:15
Yes, sure. I mean that the calculated pi should be closer to the real pi value, if I run with more iterations. But with this random number generator i can't see this behaviour in general. And the docent says its because of the thread-unsafty of the state var. I should set it global and use one or more correct #pragma omp statements to handle it. But i've tried a lot of combinations and there changes nothing. Don't know, if I take the state var global, than as static or not? And is critical right at this spot? Need state to be shared()? Or better threadprivate(state)? Have already tried a lot. — Danny Archer, Dec 08 '13 at 14:29
The OpenMP construct that your docent probably refers to is `threadprivate`. See my semi answer below for an explanation why it won't improve the solution a lot. — Hristo Iliev, Dec 09 '13 at 05:31
@DannyArcher, I updated my answer using some of Hristo's suggestions. — Z boson, Dec 09 '13 at 12:57

Hristo Iliev · Answer 1 · 2013-12-09T08:56:06.567

Your original linear congruent PRNG has a cycle length of 49400, therefore you are only getting 29700 unique test points. This is a terrible generator to be used for any kind of Monte Carlo simulations. Even if you make 100000000 trials, you won't get any closer to the true value of Pi because you are simply repeating the same points over and over again and as a result both the final value of z and iters are simply multiplied by the same constant, which cancel in the end during the division.

The per-thread seed introduced by Z boson improves the situation a little bit with the number of unique points increasing with the total number of OpenMP threads. The increase is not linear since if the seed of one PRNG falls in the sequence of another PRNG, both PRNGs produce the same sequence shifted with no more than 49400 elements. Given the cycle length, each PRNG covers 49400/RNG_MOD = 6,7% of the total output range and that is the probability of two PRNGs being synchronised. There are a total of RNG_MOD/49400 = 15 unique sequences possible. It basically means that in the best seeding case scenario you won't be able to get past 30 threads as any other thread would simply repeat the result of some of the others. The multiplier 2 comes from the fact that each point uses two elements from the sequence and therefore it is possible to get a different set of points if you shift the sequence by one element.

The ultimate solution is to completely drop your PRNG and stick to something like Mersenne twister MT19937, which has a cycle length of 2¹⁹⁹³⁷ − 1 and a very strong seeding algorithm. If you are not able to use another PRNG as you state in your question, at least modify the constants of the LCG to match those used in rand():

int rng_int(void) {
   static int state = 1;

   // & 0x7fffffff is equivalent to modulo with RNG_MOD = 2^31
   return (state = (state * 1103515245 + 12345) & 0x7fffffff);
}

Note that rand() is not a good PRNG - it is still bad. It is just a little better than the one used in your code.

This is excellent information Hristo (+1). This explains why even with `rand_r` it gives a bad result. I would say it's your fault since I just copied you [openmp-program-is-slower-than-sequential-one](http://stackoverflow.com/questions/10624755/openmp-program-is-slower-than-sequential-one) but then I see in another answer you write for `rand_r` "note that this is a very bad generator for Monte Carlo simulations, erand48 is better and the best would be to use the "Mersenne Twister" type generator from GNU Scientific Library if available". I should have read up more. — Z boson, Dec 09 '13 at 07:27
If you are interested in parallel random number generation you should also look at http://www.thesalmons.org/john/random123/papers/random123sc11.pdf which won the best paper award at SC11. — Jim Cownie, Dec 09 '13 at 09:57
Thanks for the paper. I printed it out and will read it. The broadwell processor comes out in a few months and will have the RDSEED instruction which will generate true random numbers. I wonder what impact this will have. — Z boson, Dec 09 '13 at 13:06

score 0 · Accepted Answer · edited May 23 '17 at 12:02

Try the code below. It makes a private state for each thread. I did something similar with the at rand_r function Why does calculation with OpenMP take 100x more time than with a single thread?

Edit: I updated my code using some of Hristo's suggestions. I used threadprivate (for the first time). I also used a better rand function which gives a better estimate of pi but it's still not good enough.

One strange things was I had to define the function rng_int after threadprivate otherwise I got an error "error: 'state' declared 'threadprivate' after first use". I should probably ask a question about this.

//gcc -O3 -Wall -pedantic -fopenmp main.c
#include <omp.h>
#include <stdio.h>

#define RNG_MOD 0x80000000
int state;

int rng_int(void);
double rng_doub(double range);

int main() {
    int i, numIn, n;
    double x, y, pi;

    n = 1<<30;
    numIn = 0;

    #pragma omp threadprivate(state)
    #pragma omp parallel private(x, y) reduction(+:numIn) 
    {

        state = 25234 + 17 * omp_get_thread_num();
        #pragma omp for
        for (i = 0; i <= n; i++) {
            x = (double)rng_doub(1.0);
            y = (double)rng_doub(1.0);
            if (x*x + y*y <= 1) numIn++;
        }
    }
    pi = 4.*numIn / n;
    printf("asdf pi %f\n", pi);
    return 0;
}

int rng_int(void) {
   // & 0x7fffffff is equivalent to modulo with RNG_MOD = 2^31
   return (state = (state * 1103515245 + 12345) & 0x7fffffff);
}

double rng_doub(double range) {
    return ((double)rng_int()) / (((double)RNG_MOD)/range);
}

You can see the results (and edit and run the code) at http://coliru.stacked-crooked.com/a/23c1753a1b7d1b0d

Thank you very much. This works so far. But if I increase the n, than the result did not became closer to pi. Tried up to 10000000. And also if I integrate the num of thread, which i should increase too, the result also differs, but not in the right way. Think it depends on rng functions itself, so not your fault. But now I think i understand how to handle the state var, so thats enough for me to know so far. Thank you very much again. — Danny Archer, Dec 08 '13 at 16:26
The could be a problem with using the custom random number generator. I don't know enough about pseudo random number generators. I feel more comfortable using `rand_r`. But the main point is that you want to have a different private seed for each thread and not a shared seed. — Z boson, Dec 08 '13 at 17:17
Also there might be an overflow which calculating the state. You might want to use 64 bit ints when calculating the state. — Z boson, Dec 08 '13 at 17:28
How about using a threadprivate global state instead of passing it as an argument to each call? — Hristo Iliev, Dec 09 '13 at 05:35
@HristoIliev, I edited the code using threadprivate (first time I have used it) and changed the rand function as you suggested. I got a strange error which I worked around ""error: 'state' declared 'threadprivate' after first use". Maybe I'll ask a SO question about it. — Z boson, Dec 09 '13 at 13:02

Correct OpenMP pragmas for pi monte carlo in C with not thread-safe random number generator

2 Answers2