Open MP bottleneck issue

Question

I was trying to observe a basic openMP based parallelism with the following code,

#include<stdio.h>
#include<omp.h>
#include<stdlib.h>
#include <time.h>
int main(){
  long i;
  long x[] = {0,0,0,0};
  omp_set_num_threads(4);
  clock_t time=clock();
  #pragma omp parallel for
  for(i=0;i<100000000;i++){
    x[omp_get_thread_num()]++;
  }
  double time_taken = (double)(clock() - time) / CLOCKS_PER_SEC;
  printf("%ld %ld %ld %ld %lf\n",x[0],x[1],x[2],x[3],time_taken);
}

Now, I am using a quad core i5 processor. I have checked 4 different values of the threads. The following results are found,

Set: omp_set_num_threads(1);
Out: 100000000 0 0 0 0.203921

Set: omp_set_num_threads(2);
Out: 50000000 50000000 0 0 0.826322

Set: omp_set_num_threads(3);
Out: 33333334 33333333 33333333 0 1.448936

Set: omp_set_num_threads(4);
Out: 25000000 25000000 25000000 25000000 1.919655

The x array values are accurate. But the time is surprisingly increasing in the increased number of threads. I can not get any explanation/justification behind this phenomenon. Is it somehow, omp_get_thread_num() function that is atomic in nature ? Or something else that I am missing out ?

Compiling as, gcc -o test test.c -fopenmp

UPDATE

So, as per the suggestion in the accepted answer, I have modified the code as follows,

#include<stdio.h>
#include<omp.h>
#include<stdlib.h>
int main(){
  long i, t_id, fact=1096;
  long x[fact*4];
  x[0]=x[fact]=x[2*fact]=x[3*fact]=0;
  omp_set_num_threads(4);
  double time = omp_get_wtime();
  #pragma omp parallel for private(t_id)
    for(i=0;i<100000000;i++){
      t_id = omp_get_thread_num();
      x[t_id*fact]++;
  }
  double time_taken = omp_get_wtime() - time;
  printf("%ld %ld %ld %ld %lf\n",x[0],x[fact],x[2*fact],x[3*fact],time_taken);
}

Now, the results are understandable,

Set: omp_set_num_threads(1)
Out: 100000000 0 0 0 0.250205

Set: omp_set_num_threads(2)
Out: 50000000 50000000 0 0 0.154980

Set: omp_set_num_threads(3)
Out: 33333334 33333333 33333333 0 0.078874

Set: omp_set_num_threads(4)
Out: 25000000 25000000 25000000 25000000 0.061155

Therefore, it was about the cache line size as explained in the accepted answer. Have a look there to get the answer.

Does this answer your question? [Measure execution time in C++ OpenMP code](https://stackoverflow.com/questions/10874214/measure-execution-time-in-c-openmp-code) The second answer is likely to be more useful to you than the first. — High Performance Mark, Aug 09 '20 at 11:41
You should also *always* compile with optimizations fully enabled if you want to benchmark execution time. — danielschemmel, Aug 09 '20 at 11:46
@HighPerformanceMark, yes I have also tried with omp_get_wtime() and got more or less similar result. But practically one should use the later to get the accurate timings. — pritam_coder, Aug 09 '20 at 12:04
@gha.st yes I should use O3 optimization but I thought of not using that as my test code is very small and they might do unwanted simplification of the for loop. — pritam_coder, Aug 09 '20 at 12:05
You are also timing thread creation, which is expensive but normally only required once, so this code won't be representative of the performance of real code which runs longer and has many parallel regions. (And... DECLARE YOUR VARIABLES IN THEIR MINIMAL SCOPE! We're not writing C in 1979 any longer...) — Jim Cownie, Aug 10 '20 at 07:38
@JimCownie can you please elaborate the 'MINIMAL SCOPE' factor you mentioned. — pritam_coder, Aug 12 '20 at 11:59
You declare the variables "i", and "t_id" at the top level (the outermost scope) as soon as you enter the function, but they are only used inside the parallel region and want to be thread private. If you declare them in the smallest scope in which they are referenced (i.e. inside the parallel region and inside the loop), there is no need to have a private statement on the OMP pragma (which you can forget...). (Admittedly "i" is handled silently by OpenMP rules, but it is still always cleaner to give variables the minimum scope they need.). HTH — Jim Cownie, Aug 13 '20 at 12:18

danielschemmel · Accepted Answer · 2020-08-09T11:51:00.823

Note that the 4 integers that you are operating on lie very closely together, probably on one cache line. Since cache lines are loaded into the CPU cache in one go, each thread needs to ensure that it has the latest version of that cache line. Since all threads want to modify (and not just read) that one cache line, they are constantly invalidating one another's copy. Welcome to false sharing!

To solve this problem, ensure that the integers are (physically) far enough apart from one another, e.g., by allocating structures that fill (at least) one full cache line for each thread to work with.

When executing your sample program using 4 threads on one of my machines, I got the following result:

25000000 25000000 25000000 25000000 5.049694

When modifying the program, such that the array has 4096 elements, and using the elements 0, 1024, 2048 and 3072 (which ensures enough distance), the program runs a lot faster:

25000000 25000000 25000000 25000000 1.617231

Note that although you are counting the processor time used by the whole process, without false sharing, the time should not increase significantly, but rather be more or less constant (there is some additional locking involved, but it should not usually be on the order of a 10x increase). In fact, the performance boost shown above also translates into wall-clock time (~1.25 seconds to ~500ms).

Thanks. I will be updating the question with the new output. As I also got similar result as you have got. Therefore it is all about the cache line size at the end. — pritam_coder, Aug 09 '20 at 12:06
Is there any way to know the cache line size of a processor in Windows or Linux environment. I am using i5-6500 processor but the official intel site of it does not tell about the cache distribution. It only tells that L2 Cache is 6MB. — pritam_coder, Aug 09 '20 at 12:14

score 2 · Answer 2 · answered Aug 10 '20 at 14:31

The reason for your observation, as noted by gha.st, are false sharing and the properties of the clock function.

For this reason x[omp_get_thread_num()], is an anti-pattern. Sure, you can leverage your new knowledge by adding a stride in the memory. But this also encodes hardware-specific properties (i.e. cache line size) into your data structures. This can lead to nasty code that is difficult to understand and still has bad performance portability.

The idiomatic solution is to use either of the following:

If you are only interested in an aggregate, use a reduction clause, i.e.:

long x = 0;
#pragma omp parallel for reduction(+:x)
for(i=0;i<100000000;i++){
    x++;
}
// total sum is now in x

If you need individual values within the thread, just use a private variable, preferably implicitly by scope. Or if you need particular initialization from outside the construct use firstprivate.

#pragma omp parallel
{
    long local_x = 0; // implicitly private by scope!
    #pragma omp for
    for(i=0;i<100000000;i++) {
        local_x++;
    }
    // can now do something with the the sum of the current thread.
}

And if you need the per-thread results outside, you can just use the second form and write the result once:

#pragma omp parallel
{
    long local_x = 0; // implicitly private by scope!
    #pragma omp for
    for(i=0;i<100000000;i++) {
        local_x++;
    }
    x[omp_get_thread_num()] = local_x;
}

That's not to say that you never need to design a datastructure with false-sharing in mind. But it's not as common as you might think.

Thanks for your answer. Definitely false sharing is not quite common. Basically I was implementing a CPU intensive multi cluster code and the Open MP was not giving me the full potential of parallelism. That is why I tried this small code segment to understand the reason. Fortunately false sharing came out to be the villain in my case. But yes in normal scenario false sharing can be ignored. — pritam_coder, Aug 12 '20 at 12:05

Open MP bottleneck issue

2 Answers2