Boost Random and OpenMP

Question

I'm getting a "bus error" from an OpenMP parallel section of code. I recreated a simple version of my problem below. The code essentially makes many calls to the function uniform_distribution, which draws an integer between 0 and 20000 using Boost's uniform_int_distribution.

This post warns of two threads accessing the same object. I'm guessing that's eng in my case. (Unfortunately I don't know how to write "an appropriate mutex wrapper", as that post suggests).

A possible dirty solution I thought of was to create a local eng inside the #pragma for loop and to pass that as an argument to uniform_distribution. I don't like this idea because in my real code, I'm calling many functions, and passing a local eng would be cumbersome. Also, my concern is that different threads will generate the same random number sequence if I declare eng inside uniform_distribution. So I have two requirements: How do I parallelize in a way that

Each thread is generating probabilistically independent draws from other threads?
No race conditions occur on the RNG?

Thanks; any help is warmly appreciated.

#include <omp.h>
#include <boost/random/uniform_int_distribution.hpp>

boost::random::mt19937  eng;

int uniform_distribution(int rangeLow, int rangeHigh) {
    boost::random::uniform_int_distribution<int> unirv(rangeLow, rangeHigh);
    return unirv(eng);
}
int main()
{  
    # pragma omp parallel for private(eng)
    for (int bb=0; bb<10000; bb++)
        for (int i=0; i<20000; i++)
             int a = uniform_distribution(0,20000);

    return 0;
}

minjang · Answer 1 · 2013-03-19T19:14:23.560

3

When you parallelize some code, you must consider the shared resource, which can cause data races, in turn, eventually may break your program. (Note: not all data races will break your program.)

In your case, as you expected correctly, eng is the shared by two or more threads, which must be avoided for the correct execution.

A solution for your case is privatization: making a per-thread copy for the shared resources. You need to create a separate copy of eng.

There are a number of way to do privatization for eng:

(1) Try to use threadprivate directive (link): For example, #pragma omp threadprivate(eng). However, some compilers may not support non-POD structures for this directive.

(2) In case where threadprivate is not available, use an array of eng and access with thread id: declare such as eng[MAX_THREAD]. Then, access with thread id: eng[omp_get_thread()].

However, the second solution needs to consider false sharing, which can severely hurt the performance. It's best to guarantee each item in eng[MAX_THREAD] is allocated on separate cache line boundary, which is typically 64-byte in modern desktop CPUs. There are also several ways to avoid false sharing. The simplest solution would be using padding: e.g., char padding[x] in a struct that holds eng.

edited Mar 19 '13 at 19:14

answered Mar 19 '13 at 17:21

minjang

8,860
9
42
61

A private clause will in fact create a thread-local version of a variable. Unlike a variable to which a threadprivate directive has been applied, it will be unallocated when the threads join. However, this is not a problem in the OP's example. The reason the private clause won't work here is that the eng referenced in uniform_distribution is the global one, not the thread-specific one. – jerry Mar 19 '13 at 18:26
Yes, your point regarding `priavte` is correct. I removed a sentence. But, the problem what OP have is essentially privatization issue. Making thread-local will be a solution. Your second solution is also a way for privatization. – minjang Mar 19 '13 at 19:23
@minjang: Thank you for your helpful answer. I tried (1), but threadprivate complained about the `eng` argument. Can you elaborate on (2), possibly with an example? I'm afraid I don't quite understand the comment on _false sharing_. – covstat Mar 19 '13 at 21:25
Yes, for some OpenMP implementations, `threadprivate` only gets POD types. For the second solution, first just declare `eng[MAX_THREAD]`, and see it works well. Then, you can attack false sharing problem later. – minjang Mar 19 '13 at 21:45
@minjang: Your suggestion, coupled with a version of Mikael Persson's seeding, does the trick. I'm now interested in your false sharing comment. My cache line size is 64, and sizeof(eng[0]) is 2504. Does this mean false sharing won't be an issue, or should I still be worried? Thanks! – covstat Mar 26 '13 at 15:36
First, check your speedup is good. Regarding false sharing, the element size is greater than the cache line, but it'd better to allocate elements on multiple of cache line boundaries. It could be false sharing if the last few elements of eng[0] and the first few elements of eng[1] can share the cache line. So, put some padding to multiples of 64, in this case, it'd be char[56]. And, allocate these arrays using compiler extensions like __declspec(align(64)). – minjang Mar 26 '13 at 19:46

score 3 · Answer 2 · answered Mar 19 '13 at 17:26

I think the most convenient solution would involve a thread_local RNG and a seeding that involves the thread ID as a unique number for each thread, for example, you can do a XOR between the system time and the thread-id to seed the RNG. Something along the lines of (using C++11):

#include <omp.h>
#include <boost/random/uniform_int_distribution.hpp>

#include <thread>
#include <ctime>

boost::random::mt19937& get_rng_engine() {
  thread_local boost::random::mt19937 eng(
    reinterpret_cast<unsigned int>(std::time(NULL)) ^ std::this_thread::get_id());
  return eng;
};

(NOTE: you can also use <random> if you are going to use C++11)

If you can't use C++11 then you can use boost::thread instead to have a similar behavior, see the Boost page on thread-local storage too.

jerry · Answer 3 · 2013-03-21T17:04:48.490

You have two options:

have individual random number generators for each thread and seed them differently
use mutual exclusion

First, an example of mutual exclusion:

# pragma omp parallel for
for (int bb=0; bb<10000; bb++)
{
    for (int i=0; i<20000; i++)
    {
        // enter critical region, disallowing simulatneous access to eng
        #pragma omp critical
        {
            int a = uniform_distribution(0,20000);
        }
        // presumably some more code...
    }
    // presumably some more code...
}

Next, an example of thread-local storage with seeding:

# pragma omp parallel
{
    // declare and seed thread-specific generator
    boost::random::mt19937 eng(omp_get_thread_num());
    #pragma omp for
    for (int bb=0; bb<10000; bb++)
    {
        for (int i=0; i<20000; i++)
        {
            int a = uniform_distribution(0,20000, eng);
            // presumably some more code...
        }
        // presumably some more code...
    }
}

Both of these snippets are just illustrative, depending on your requirements (say security related vs. a game vs. modelling) you may want to pick one over the other. You will probably also want to change the exact implementation to suit your usage. For instance, how you seed the generator is important if you want it to be either repeatable or closer to truly random (whether that's possible is system specific). This applies to both solutions equally (though to get reproducibility in the mutual exclusion case is harder).

The thread-local generator may run faster while the mutual exclusion case should use less memory.

EDIT: To be clear, the mutual exclusion solutions only makes sense if the generation of the random numbers is not the bulk of the thread's work (that is // presumably some more code... in the example exists and doesn't take a trivial amount of time to complete). The critical section only needs to encompass the access to the shared variable, changing your architecture a little would allow you finer control over that (and in the thread-local storage case, could also allow you to avoid passing an eng reference around)

While using critical section provides correctness, it will effectively serialize the code, which doesn't get performance improvement. So, I don't think using critical section is a good solution for such easy parallelization case. — minjang, Mar 19 '13 at 19:15
@jerry: Thank you for your answer. I tried both; the first one resolved the race condition, but resulted in no performance gain (but it may have been because my test code was too simple), which is what minjang predicted. Your second answer is similar to minjang's, but it appears to be behaving strangely. I will continue experimenting with your second method. — covstat, Mar 19 '13 at 21:41
@covstat, regarding jerry's second solution, try put `private(eng)` at the `omp for`. This code still will share `eng`. — minjang, Mar 19 '13 at 21:53
@minjang In this example code, that's definitely true. However, the OP stated that the example is a simplified version, I'm assuming the original code does more than just calculate the random numbers in the loops (that's what I meant by `// presumably some more code...`). Depending on how much that is, the effect of the mutual exclusion could be anywhere from unmeasurable to effectively serializing. — jerry, Mar 20 '13 at 02:30
@covstat What sort of strange results are you seeing? I believe `eng` should be private as it's declared within the parallel structure, but it sound like your implementation might have a problem non-POD thread-local data. To be honest, this isn't something I've run into before, but clearly minjang has. What compiler/version are you using? What version of the OpenMP standard does it implement? Also, out of curiosity, how exactly did you modify uniform_distribution to take the generator? — jerry, Mar 20 '13 at 03:07

Boost Random and OpenMP

3 Answers3

Linked