0

I try to introduce OpenMP to my c++ code to improve the performance using a simple case as shown:

#include <omp.h>
#include <chrono>
#include <iostream>
#include <cmath>

using std::cout;
using std::endl;

#define NUM 100000

int main()
{
    double data[NUM] __attribute__ ((aligned (128)));;

    #ifdef _OPENMP
        auto t1 = omp_get_wtime();
    #else
        auto t1 = std::chrono::steady_clock::now();
    #endif

    for(long int k=0; k<100000; ++k)
    {

        #pragma omp parallel for schedule(static, 16) num_threads(4)
        for(long int i=0; i<NUM; ++i)
        {
            data[i] = cos(sin(i*i+ k*k));
        }
    }

    #ifdef _OPENMP
        auto t2 = omp_get_wtime();
        auto duration = t2 - t1;
        cout<<"OpenMP Elapsed time (second): "<<duration<<endl;
    #else
        auto t2 = std::chrono::steady_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
        cout<<"No OpenMP Elapsed time (second): "<<duration/1e6<<endl;
    #endif

    double tempsum = 0.;
    for(long int i=0; i<NUM; ++i)
    {
        int nextind = (i == 0 ? 0 : i-1);
        tempsum += i + sin(data[i]) + cos(data[nextind]);
    }
    cout<<"Raw data sum: "<<tempsum<<endl;
    return 0;    
}

Access to a tightly looped int array (size = 10000) and change its elements in either parallel or non-parallel way.

Build as

g++ -o test test.cpp 

or

g++ -o test test.cpp -fopenmp

The program reported results as:

No OpenMP Elapsed time (second): 427.44
Raw data sum: 5.00009e+09

OpenMP Elapsed time (second): 113.017
Raw data sum: 5.00009e+09

Intel 10th CPU, Ubuntu 18.04, GCC 7.5, OpenMP 4.5.

I suspect that the false sharing in the cache line leads to the bad performance of the OpenMP version code.

I update the new test results after increasing the loop size, the OpenMP runs faster as expected.

Thank you!

Mangoccc
  • 41
  • 7
  • how did you compile? Optimizations turned on? – 463035818_is_not_an_ai Mar 28 '22 at 10:40
  • 1
    `0.5 / 10000` is almost nothing. Try to increase the workload. Almost Nothing + overhead distributed on 10 cores is almost nothing + a lot more overhead. – 463035818_is_not_an_ai Mar 28 '22 at 10:42
  • I do not set -O3 or -O0, optimization settings kept as default. – Mangoccc Mar 28 '22 at 10:44
  • default optimization is no optimization. Without optimizations it can be arbitrarily slow. If you care about runtime, always turn on optimizations – 463035818_is_not_an_ai Mar 28 '22 at 10:45
  • It seems that O3 optimization does not help too much. The difference between O0 and O3 under OpenMP conditions is 0.1s. You are right, maybe the workload does not cover the additional overhead. – Mangoccc Mar 28 '22 at 10:50
  • I have 2 comments: 1) If you remove `schedule(static, 16)` the overheads will be smaller, 2) not only the workload is very small, but the code is memory bound. – Laci Mar 28 '22 at 12:00
  • Does memory-bound mean that fetching and flushing memory data is time-consuming, especially for a CPU with more than one core (implying potential CPU-CPU communication)? – Mangoccc Mar 28 '22 at 12:36
  • Does this answer your question? [How to generate random numbers in parallel?](https://stackoverflow.com/questions/4287531/how-to-generate-random-numbers-in-parallel) – paleonix Mar 28 '22 at 13:06
  • 1
    Your performance problems might have nothing to do with false sharing, but with using `rand` in parallel. – paleonix Mar 28 '22 at 13:07
  • rand() just mimic a complex calculation process. This case is a minimal case, its prototype involves a very complex simulation situation, and I simplify it to a rand(). I really do not realize the if rand() works well for multi-thread code. – Mangoccc Mar 28 '22 at 13:15

2 Answers2

1
  1. Since you're writing C++, use the C++ random number generator, which is threadsafe, unlike the C legacy one you're using.
  2. Also, you're not using your data array, so the compiler is actually at liberty to remove your loop completely.
  3. You should touch all your data once before you do the timed loop. That way you ensure that pages are instantiated and data is in or out of cache depending.
  4. Your loop is pretty short.
Victor Eijkhout
  • 5,088
  • 2
  • 22
  • 23
1
  1. rand() is not thread-safe (see here). Use an array of C++ random-number generators instead, one for each thread. See std::uniform_int_distribution for details.
  2. You can drop #ifdef _OPENMP variations in your code. In a Bash terminal, you can call your application as OMP_NUM_THREADS=1 test. See here for details.
  3. So you can remove num_threads(4) as well because you can explicitly specify the amount of parallelism.
  4. Use Google Benchmark or command-line parameters so you can parameterize the number of threads and array size.

From here, I expect you will see:

  1. The performance when you call OMP_NUM_THREADS=1 test is close to your non-OpenMP version.
  2. The array of C++ RNG generators is faster than calling rand() from multiple threads.
  3. The multi-threaded version is still slower than the single-threaded version when using a 10,000 element array.
Daniel Dearlove
  • 565
  • 2
  • 12
  • Thank you for your suggestions. Referring to point 3, you said that OpenMP is slower than the single thread version using 10, 000 elements. I just tested it on my PC, the OpenMP costs about 10.69 s compared to the 41.503 s for a single thread. The data length is 10, 000 and the outer loop k ends at 100, 000. – Mangoccc Mar 28 '22 at 16:16
  • I see you changed the "work" to `cos(sin(i*i+ k*k))` which will be independent calculations and, to me, looks like it is more effort than `rand()`. You also increased the loop length to 100,000 items which means the calculation should dominate over the thread setup/teardown costs. If, like your original code, your multi-threaded test used 4 threads then 10.69s for multiple threads versus 41.503s for a single thread sounds correct. Therefore, point 3 is no longer relevant. – Daniel Dearlove Mar 29 '22 at 07:12