I am running a simulation in which many random numbers are generated. The RNG is implemented as a C++ object which has a public method returning the random number. In order to use it with OpenMP parallelization, I simply create an array of such RNG objects, one for every thread. Each thread then generates its own random numbers by calling one of the RNGs. E.g.:
for (int i = 0; i < iTotThreads; i++) {
aRNG[i] = new RNG();
}
// ... stuff here
#pragma omp parallel
{
iT = omp_get_thread_num();
#pragma omp for
for ( /* big loop */) {
// more stuff
aRNG[iT]->getRandomNumber();
// more stuff
}
}
Even though each RNG works on its own member variables and two such RNGs do not fit within a single cache line (I also tried explicitly aligning each of them at creation), there seems to be some false sharing going on as the code does not scale at all.
If I instantiate the objects within an omp parallel region:
#pragma omp parallel
{
i = omp_get_thread_num();
aRNG[i] = new RNG();
}
the code scales perfectly. Do you have any idea of what I am missing here?
EDIT: by the way, in the second case (the one that scales well), the parallel region in which I create the RNGs is not the same as the one in which I use them. I'm counting on the fact that when I enter the second parallel region every pointer in aRNG[]
will still point to one of my objects, but I guess this is bad practice...