12

I was using rand() function to generate pseudo-random numbers between 0,1 for a simulation purpose, but when I decided to make my C++ code run in parallel (via OpenMP) I noticed rand() isn't thread-safe and also is not very uniform.

So I switched to using a (so-called) more uniform generator presented in many answers on other questions. Which looks like this

double rnd(const double & min, const double & max) {
    static thread_local mt19937* generator = nullptr;
    if (!generator) generator = new mt19937(clock() + omp_get_thread_num());
    uniform_real_distribution<double> distribution(min, max);
    return fabs(distribution(*generator));
}

But I saw many scientific errors in my original problem which I was simulating. Problems which were both against the results from rand() and also against common sense.

So I wrote a code to generate 500k random numbers with this function, calculate their average and do this for 200 times and plot the results.

double SUM=0;
for(r=0; r<=10; r+=0.05){   
    #pragma omp parallel for ordered schedule(static)
    for(w=1; w<=500000; w++){   
        double a;
        a=rnd(0,1);
        SUM=SUM+a;
    } 
    SUM=SUM/w_max;
    ft<<r<<'\t'<<SUM<<'\n';
    SUM=0;
}   

We know if instead of 500k I could do it for infinite times it should be a simple line with value 0.5. But with 500k we will have fluctuations around 0.5.

When running the code with a single thread, the result is acceptable:

enter image description here

But here is the result with 2 threads:

enter image description here

3 threads:

enter image description here

4 threads:

enter image description here

I do not have my 8 threaded CPU right now but the results were even worth there.

As you can see, They are both not uniform and very fluctuated around their average.

So is this pseudo-random generator thread-unsafe too?

Or am I making a mistake somewhere?

Alireza
  • 410
  • 3
  • 17
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/171098/discussion-on-question-by-alireza-why-isnt-this-random-number-generator-thread). – Andy May 15 '18 at 15:46
  • 1
    Have you tried using better seeds for your random number generators? – François Andrieux May 15 '18 at 15:51
  • @FrançoisAndrieux What is your suggestion? I thought clock and thread number is both random and thread-safe. – Alireza May 15 '18 at 15:52
  • Although not directly related : _`rand()` does not **need** to be threadsafe (cfr. [POSIX `stdlib.h`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/rand.html) )_ – kvantour May 15 '18 at 15:54
  • 3
    Your testing code as written has a race condition on `SUM`... – Max Langhof May 15 '18 at 15:56

2 Answers2

13

There are three observations about your test output I would make:

  • It has much stronger variance than a good random source's average should provide. You observed this yourself by comparing to the single thread results.

  • The calculated average decreases with thread count and never reaches the original 0.5 (i.e. it's not just higher variance but also changed mean).

  • There is a temporal relation in the data, particularly visible in the 4 thread case.

All this is explained by the race condition present in your code: You assign to SUM from multiple threads. Incrementing a double is not an atomic operation (even on x86, where you'll probably get atomic reads and writes on registers). Two threads may read the current value (e.g. 10), increment it (e.g. both add 0.5) and then write the value back to memory. Now you have two threads writing a 10.5 instead of the correct 11.

The more threads try to write to SUM concurrently (without synchronization), the more of their changes are lost. This explains all observations:

  • How hard the threads race each other in each individual run determines how many results are lost, and this can vary from run to run.

  • The average is lower with more races (for example more threads) because more results are lost. You can never exceeed the statistical 0.5 average because you only ever lose writes.

  • As the threads and scheduler "settle in", the variance is reduced. This is a similar reason to why you should "warm up" your tests when benchmarking.

Needless to say, this is undefined behavior. It just shows benign behavior on x86 CPUs, but this is not something the C++ standard guarantees you. For all you know, the individual bytes of a double might be written to by different threads at the same time resulting in complete garbage.

The proper solution would be adding the doubles thread-locally and then (with synchronization) adding them together in the end. OMP has reduction clauses for this specific purpose.

For integral types, you could use std::atomic<IntegralType>::fetch_add(). std::atomic<double> exists but (before C++20) the mentioned function (and others) are only available for integral types.

Max Langhof
  • 23,383
  • 5
  • 39
  • 72
  • Thanks. This seems logical, will declaring SUM as an array with size of number of threads and assign threads to write in their own SUM and summing them up when the loop is finished fix the problem? – Alireza May 15 '18 at 16:07
  • 2
    Yes, that would fix it, but you can only get the number of threads while in a parallel region so it's a bit awkward (although possible) to do this. But OMP also provides reduction clauses that can do this for you. – Max Langhof May 15 '18 at 16:09
  • 2
    @Alireza while that solution is correct, if unpadded it has terrible performance due to false sharing. This is a good example of why one should use the idiomatic solution (reduction clause). – Zulan May 15 '18 at 16:13
  • 2
    Tried the reduction clause. Worked. Thank you. – Alireza May 15 '18 at 16:15
  • 1
    Regarding atomic, I would recommend `#pragma atomic` rather than `std::atomic` (https://stackoverflow.com/q/41309299/620382https://stackoverflow.com/q/41309299/620382) – Zulan May 15 '18 at 16:19
6

The problem is not in your RNG, but in your summation. There is simply a race condition on SUM. To fix this, use a reduction, e.g. change the pragma to:

#pragma omp parallel for ordered schedule(static) reduction(+:SUM)

Note that using thread_local with OpenMP is technically not defined behavior. It will probably work in practice, but the interaction between OpenMP and C++11 threading concepts is not well defined (see also this question). So the safe OpenMP alternative for you would be:

static mt19937* generator = nullptr;
#pragma omp threadprivate(generator)
Zulan
  • 21,896
  • 6
  • 49
  • 109