3

I am running a simulation in which many random numbers are generated. The RNG is implemented as a C++ object which has a public method returning the random number. In order to use it with OpenMP parallelization, I simply create an array of such RNG objects, one for every thread. Each thread then generates its own random numbers by calling one of the RNGs. E.g.:

  for (int i = 0; i < iTotThreads; i++) {
    aRNG[i] = new RNG();
  }
  // ... stuff here
#pragma omp parallel 
  {
    iT = omp_get_thread_num();
#pragma omp for
    for ( /* big loop */) {
      // more stuff
      aRNG[iT]->getRandomNumber();
      // more stuff
    }
  }  

Even though each RNG works on its own member variables and two such RNGs do not fit within a single cache line (I also tried explicitly aligning each of them at creation), there seems to be some false sharing going on as the code does not scale at all.

If I instantiate the objects within an omp parallel region:

#pragma omp parallel
  { 
    i = omp_get_thread_num();
    aRNG[i] = new RNG();
  }

the code scales perfectly. Do you have any idea of what I am missing here?

EDIT: by the way, in the second case (the one that scales well), the parallel region in which I create the RNGs is not the same as the one in which I use them. I'm counting on the fact that when I enter the second parallel region every pointer in aRNG[] will still point to one of my objects, but I guess this is bad practice...

AstralCar
  • 55
  • 2
  • 8
  • Are you using any global variables (or static variables) in your random number generator ? – jcxz Jan 09 '14 at 10:10
  • The result is normalized to a `static const unsigned long MY_MAX_RAND` before being returned, but otherwise each RNG only writes to its own private member variables and arrays. – AstralCar Jan 09 '14 at 10:14
  • Unrelated, but why are you using pointers here?! – Konrad Rudolph Jan 09 '14 at 13:22
  • The constructor of the RNG class actually takes some arguments so that the RNG of each parallel thread is initialized with a different seed. It was just more convenient for me to have an array of pointers, and then call `new RNG()` for each of them. – AstralCar Jan 09 '14 at 13:52
  • Most memory allocators nowadays are thread-aware and use separate per-thread memory arenas. Try adding a dummy padding variable to your PRNG state the size of a cache line and make sure the compiler does not optimise it out. – Hristo Iliev Jan 09 '14 at 21:58
  • @Hristo No padding or aligning of the member variables seems to help. I'm not sure of what was going on (I'll investigate more when I have time), but the solution you proposed below works perfectly! – AstralCar Jan 10 '14 at 10:21
  • That makes no sense. What kind of system are you running your code on? – Hristo Iliev Jan 10 '14 at 10:31
  • I'm on a Linux 8-core workstation and I use g++. I tried with `__attribute__ (( aligned(64) ))` on the state vector of the RNG and a couple other member variables, as well as on the RNG class itself, to no avail. Padding the RNG by 64 or 128 bytes also does not change the scaling in any way. – AstralCar Jan 10 '14 at 10:45
  • I forgot: it's a Xeon E5607 with cache alignment 64, 8192 KB of cache. – AstralCar Jan 10 '14 at 11:18
  • You are running on a NUMA system. When you allocate the PRNGs in the main thread, all their state vectors end up on the NUMA node where the main thread executes and for half the thread those are in remote memory (therefore slower access to them). When you have each thread allocate its own PRNG, the memory is allocated on the same node where the thread executes and therefore the access is local. This becomes important with large data sets since your CPU has only 8 MB of last-level cache. Also, are you binding your threads? – Hristo Iliev Jan 13 '14 at 08:59
  • Wow, thank you for the explanation. This makes sense. I'm not binding the threads at the moment, but it's a good option for me. I'd prefer parallel allocation + thread binding, rather than having the PRNG as threadprivate (I need to declare it as static in order to use it as threadprivate - it is a member of another class - and I just noticed that this option slows down things a little). – AstralCar Jan 13 '14 at 11:42

1 Answers1

4

Although I doubt from your description that false sharing is the cause of your problem, why don't you simplify the code in this way:

  // ... stuff here
#pragma omp parallel 
  {
    RNG rng;
#pragma omp for
    for ( /* big loop */) {
      // more stuff
      rng.getRandomNumber();
      // more stuff
    }
  }

Being declared inside a parallel region rng will be a private variable with automatic storage duration, so:

  • each thread will have its own private random number generator (no false sharing possible here)
  • you don't have to manage allocation/deallocation of a resource

In case this approach is unfeasible, and following the suggestion of @HristoIliev, you can always declare a threadprivate variable to hold the pointer to the random number generator:

static std::shared_pointer<RNG> rng;
#pragma omp threadprivate(rng);

and allocate it in the first parallel region:

rng.reset( new RNG );

In this case though there are a few caveats to ensure that the value of rng will be preserved across parallel regions (quoting from the OpenMP 4.0 standard):

The values of data in the threadprivate variables of non-initial threads are guaranteed to persist between two consecutive active parallel regions only if all the following conditions hold:

  • Neither parallel region is nested inside another explicit parallel region.
  • The number of threads used to execute both parallel regions is the same.
  • The thread affinity policies used to execute both parallel regions are the same.
  • The value of the dyn-var internal control variable in the enclosing task region is false at entry to both parallel regions.

If these conditions all hold, and if a threadprivate variable is referenced in both regions, then threads with the same thread number in their respective regions will reference the same copy of that variable.

Massimiliano
  • 7,842
  • 2
  • 47
  • 62
  • It behaves like false sharing, but I agree that it is an unlikely explanation... I just have no idea of what could be the issue. There should be no conflict between the threads. Regarding your suggestion: the code is actually much more structured than in my example. Unfortunately, I cannot declare the RNG within the same parallel region in which I am using it; it would mean creating a new RNG at every simulation step. – AstralCar Jan 09 '14 at 14:01
  • @AstralCar, declare `static RNG* rng; #pragma omp threadprivate(rng)`; allocate it in the first parallel region; delete it in the last parallel region. – Hristo Iliev Jan 09 '14 at 21:22
  • I was looking into using `threadprivate` but I did not know that it does work with static pointers. Works perfectly now even without recurring to `std::shared_ptr`. Thanks guys! – AstralCar Jan 10 '14 at 10:12
  • @AstralCar Recurring to `shared_ptr` is just for resource management. In fact, you allocate the thing in the first parallel region and forget about its deallocation: `shared_ptr` will take care of that in its destructor. – Massimiliano Jan 10 '14 at 10:21
  • @Massimiliano Ok thanks! I'll try it out... Since we have >80,000 lines of code, I usually prefer to write out explicitly all creations and destructions. – AstralCar Jan 10 '14 at 10:42
  • @HristoIliev Just a question, in this case, does threadprivate produce the same effect as firstprivate ? – lorniper Aug 31 '15 at 15:34
  • 1
    @lorniper, `firstprivate` variables do not persist across parallel regions. `threadprivate` variables do. Also, `threadprivate` variables are not initialised like `firstprivate` since there is no "parent" variable to get their value from. – Hristo Iliev Aug 31 '15 at 20:05
  • @HristoIliev Thanks for answering, perhaps I should ask in an independent question, but could you be more specific about "firstprivate variables do not persist across parallel regions" – lorniper Sep 01 '15 at 14:53
  • @lorniper, think of `threadprivate` as global (or static) variables that exist outside the parallel region and therefore have their value preserved (though you can only access the value in the master thread when not in a region) and of `firstprivate` as automatic variables that get created on entry into the parallel region and destroyed on exit. Also, `firstprivate` variables get initialised by copying a variable from the master thread, therefore all threads start with the same value of a given `firstprivate` variables. – Hristo Iliev Sep 02 '15 at 06:56
  • @HristoIliev appreciate if you can answer this question:http://stackoverflow.com/questions/32347008/confused-about-firstprivate-and-threadprivate-in-openmp-context – lorniper Sep 02 '15 at 07:28