OpenMP implementation of the PCG PRNG

Question

I am trying to generate random numbers with the PCG method. I have tested two differents implementation which are given by block 1 and 2 in the following code. The block 1 is correct and scale as expected with the number of thread. The block 2 does not scale in the right way. I do not understand what is wrong with it.

#include <chrono>
#include <iostream>
#include <omp.h>

#include "../include_random/pcg_basic.hpp"

int main()
{
    

    // /*Bloc 1*/
    // omp_set_num_threads (threadSNumber);
    // startingTime = std::chrono::system_clock::now();
    // #pragma omp parallel
    // {
    //     int threadID = omp_get_thread_num();
    //     pcg32_random_t rng;
    //     pcg32_srandom_r(&rng, time(NULL) ^ (intptr_t)&printf,(intptr_t)&threadID);
    //     // uint32_t bound =1;
    //     #pragma omp for reduction (+:sum)
    //     for (int step = 0; step < N; step++)
    //     {
    //         // sum += 0.5 - (double)pcg32_boundedrand_r(&rng,bound);
    //         sum += 0.5 -((double)pcg32_random_r(&rng)/(double)UINT32_MAX);
    //     }
    // }
    /**Bloc 2**/
    omp_set_num_threads (threadSNumber);
    pcg32_random_t *rng;
    rng = new pcg32_random_t[threadSNumber];
    #pragma omp parallel
    {
        int threadID = omp_get_thread_num();
        pcg32_srandom_r(&rng[threadID], time(NULL) ^ (intptr_t)&printf,(intptr_t)&threadID);
    }
    startingTime = std::chrono::system_clock::now();
    #pragma omp parallel
    {
        int threadID = omp_get_thread_num();
        #pragma omp for reduction (+:sum)
        for (int step = 0; step < N; step++)
        {
            sum += 0.5 -((double)pcg32_random_r(&rng[threadID])/(double)UINT32_MAX);
        }
    }
    delete[] rng;
    /****/
    auto end = std::chrono::system_clock::now();
    auto diff = end - startingTime;
    double total_time = chrono::duration <double, std::ratio<1>> (diff).count();
    cout << "The result of the sum is "<< sum/N << "\n" << endl;
    cout << "# Total time:  "<<  (int)total_time/3600<<"h    "<< ((int)total_time%3600)/60<<"m    "<< (int)total_time%60 << "s        (" << total_time << " s)" << endl;
    return 0;
}

The block 1 scale as expected with the thread number, but the block 2 does not.

# thread    1     2    3    4
block1(s)  3.27 1.64 1.12 0.83
block2(s)  4.60 13.7 8.28 10.9

These examples are minimal examples to reproduce the issue. It is a piece of a bigger function that is in a bigger code.

I want to initialize the seed only once, and every time step I compute a bunch of random number which are used in another function (not doing the sum like this, which is only done here to record something). It is possible to use block 1 but it means that I initialize the seed at each time step instead of doing it once. Moreover, I do not understand the scaling of the block2.

What is wrong in the block 2? Why I get this scaling? There are not using the same rng so I should avoid the data race or I misunderstand something.

I haven't really read through your code, but seein the array of rngs there is probably false sharing (more than one rng on the same cache line). See [Understanding `std::hardware_destructive_interference_size` and `std::hardware_constructive_interference_size`](https://stackoverflow.com/q/39680206/10107454) — paleonix, Jan 07 '23 at 23:58
Im' guessing the `time` function is not thread-safe. Since you want to write C++, why not use `chrono` which *is* threadsafe. (Also, don't use `new`.) — Victor Eijkhout, Jan 08 '23 at 01:40
*Use RAII instead of `new`, i.e. `std::vector` or at least `std::make_unique`. — paleonix, Jan 08 '23 at 01:49
If I replace the pointer by a `vector`, I get shorter time but the same scaling with the thread number. `(s) 3.40 6.95 4.53 5.27`. I read the linked topic but I do not understand how to solve the cache line issue in my case. Should I modify the `pcg32_random_t` structure? — Hunken, Jan 08 '23 at 20:08
Using RAII is about memory safety, not performace. The linked Standard Library feature gives you information about the distance that two objects in memory need to have to avoid false sharing (being on the same cache line). See the usage of `alignas` for `cache_int ` in the code block of the first answer. — paleonix, Jan 08 '23 at 22:19

Hunken · Accepted Answer · 2023-01-09T15:02:01.773

@paleonix Indeed the issue is coming from false sharing. This code solve the weird scaling with the thread number. However, the first block is still the fastest one.

#ifdef __cpp_lib_hardware_interference_size
    using std::hardware_constructive_interference_size;
    using std::hardware_destructive_interference_size;
#else
    // 64 bytes on x86-64 │ L1_CACHE_BYTES │ L1_CACHE_SHIFT │ __cacheline_aligned │ ...
    constexpr std::size_t hardware_constructive_interference_size = 64;
    constexpr std::size_t hardware_destructive_interference_size = 64;
#endif

struct cache_pcg32 {
    alignas(hardware_destructive_interference_size) pcg32_random_t rng;
};

inline uint64_t timeForSeed()
{
    return std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now().time_since_epoch()).count();
}
int main()
{
    

    // /*Block 1*/
    // omp_set_num_threads (threadSNumber);
    // startingTime = std::chrono::system_clock::now();
    // for (int i = 0; i < 200000; i++)
    // {
    //     #pragma omp parallel
    //     {
    //         int threadID = omp_get_thread_num();
    //         // int threadID = 1;
    //         pcg32_random_t rng;
    //         pcg32_srandom_r(&rng, timeForSeed() ^ (intptr_t)&printf,(intptr_t)&threadID);
    //         #pragma omp for reduction (+:sum)
    //         for (int step = 0; step < N; step++)
    //         {
    //             sum += 0.5 -((double)pcg32_random_r(&rng)/(double)UINT32_MAX);
    //         }
    //     }
    // } 
    
    /**Block 2**/
    startingTime = std::chrono::system_clock::now();
    omp_set_num_threads (threadSNumber);
    vector<cache_pcg32> rngVector;
    rngVector.resize(threadSNumber);

    #pragma omp parallel
    {
        int threadID = omp_get_thread_num();
        pcg32_srandom_r(&rngVector[threadID].rng, timeForSeed() ^ (intptr_t)&printf,(intptr_t)&threadID);
    }
    for (int i = 0; i < 200000; i++)
    {
        #pragma omp parallel
        {
            int threadID = omp_get_thread_num();
            #pragma omp for reduction (+:sum)
            for (int step = 0; step < N; step++)
            {
                sum += 0.5 -((double)pcg32_random_r(&rngVector[threadID].rng)/(double)UINT32_MAX);
            }
        }
    }

@Victor I changed the time(NULL) for atimeForSeed() function to use std::chrono but it does not seem to change anything (performance or result of the sum).

OpenMP implementation of the PCG PRNG

1 Answers1