Array Padding does not mitigate false sharing? C, OpenMP

Question

#include <stdio.h>
#include <omp.h>
static long num_steps = 100000000; double step;
#define PAD 8
#define NUM_THREADS 6
void main(){
int i, nthreads; double pi=0, sum[NUM_THREADS][PAD]={0};
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);

//Starting Timer
double time_start = omp_get_wtime();

#pragma omp parallel
{
    int i, id, nthrds;
    double x;
    id = omp_get_thread_num();
    nthrds = omp_get_num_threads();
    if(id==0) nthreads = nthrds;
    for(i=id;i<num_steps;i=i+nthrds){
        x = (i+0.5)*step;
        sum[id][0] += 4.0/(1.0+x*x);
    }
}
for(i=0; i<nthreads; i++)pi +=sum[i][0]*step;

//Ending Timer
double time_end = omp_get_wtime();

double timepass = time_end-time_start;

//New Run, how many threads
printf("Integration Program runs with %d threads\n", nthreads);

//Print Result of Integral
printf("Integration Result: %lf\n", pi);

//Print How much Time has passed
printf("%lf Time passed for Integration...\n", timepass);

//Print Effective Time
printf("Effective Total Time: %lf\n\n", timepass*nthreads);
}

This snippet of code is taken from an OpenMP tutorial by Tim Matson. This code integrates the function 4.0/(1+x*x) but holds each partial result in a 2d-array named sum. I use a linux machine and have checked I have the standard 64 bit cache lines on L1, L2, and L3. I compiled using gcc, no optimizations and was expecting runtime to decrease. This is what I got for the runtime:

1 threads: 0.356362

2 threads: 0.541903

3 threads: 0.416097

4 threads: 0.346139

5 threads: 0.286879

6 threads: 0.315139

It seems that false sharing still occurs even with the padding and I am confused why. I have changed the padding to larger sizes and performance scalability is similarly poor. The only thing that seems to fix the poor scalability problem is by turning on the compiler optimizations, even just the -O1 would make the code scale great. I am not sure why this is the case though.

Your problem is that the `parallel` construct creates 6 threads, but all six threads are executing the same code/work. Please have a look at the `for` construct that will distribute the work of the loop in your code amongst the threads. — Michael Klemm, May 16 '22 at 19:39
Why are you doing any performance measurements at all on un-optimized code? — John Bollinger, May 16 '22 at 19:39
@MichaelKlemm OP manually split the loop for iterations by the threads already — dreamcrash, May 16 '22 at 20:54
The work is properly distributed between threads. In fact, the code scale well on my machine (i5-9600KF processor) and I cannot reproduce the problem (on Linux with GCC 11.2.0 from `-O0` to `-O3`). The same on Windows with GCC 9.Thus, I cannot reproduce your problem. What exact processor do you use? (see `/proc/cpuinfo`). What time is reported? I guess it is `timepass`. — Jérôme Richard, May 16 '22 at 21:03
@JérômeRichard if you say you can not reproduce the problem, does that mean that you do see false sharing abatement? — Victor Eijkhout, May 16 '22 at 21:20
@VictorEijkhout I see no significant effect of false sharing: the value of `timepass` is about 5.5x~5.6x time smaller with 6 threads (on 6 cores). This is very good since the frequency scaling cause the optimal speed up to be bound to 5.7x . — Jérôme Richard, May 16 '22 at 21:31
@JérômeRichard That's what I expected. Processors these days are very good at preventing the effects of false sharing by keeping separate accumulators that are only written back when strictly necessary. — Victor Eijkhout, May 16 '22 at 21:37
@VictorEijkhout I am not aware of any optimizations preventing false-sharing on x86-64 processors. But there should be no false sharing as the OP pointed out due to `PAD` being set to 8 so each row of `sum` takes 64 bytes (ie. 1 cache line). Each thread should operate on a different cache line. In fact, I can see false sharing effect with `PAD=4` (x3.5 slower) and not with `PAD=8` which is expected. But the OP platform appears to behave unexpectedly. — Jérôme Richard, May 16 '22 at 22:13
Thank you all for answering. I will proceed orderly: @ Michael Klemm, I believe that as others have pointed out, the work sharing is distributed correctly. Please do note that the iterations jump by ```nthrds``` for each loop iteration @John Bollinger, I did not plan any optimisations. But, as I have stated, that is the ONLY WAY I have observed to avoid the false sharing problem. I too would like the code to run great without any optimisations — GingerGengar123, May 16 '22 at 23:49
@Jerome Richard, Thank you for that comment. I checked the contents of ```/proc/cpuinfo``` and I saw that I have a: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz, cpu cores : 6, clflush size : 64, cache_alignment : 64, cpu MHz : 2200.000 — GingerGengar123, May 16 '22 at 23:52

Victor Eijkhout · Answer 1 · 2022-05-18T12:08:48.247

I wonder if the story about false sharing needs to be revisited. I've adapted the code to

#ifndef PAD
#define PAD 8
#endif

#ifndef NTHREADS
#define NTHREADS 6
#endif

void main(){
  int i, nthreads; double pi=0, sum[NTHREADS][PAD]={0};
  step = 1.0/(double) num_steps;
  omp_set_num_threads(NTHREADS);

also:

  printf("Integration Program runs with %d threads, padding=%d\n", nthreads,PAD);

so that I can run a quick shell loop:

for p in 1 2 3 4 5 6 7 8 ; do
    ## compile with -DPAD=$p -DNTHREADS=whatever

and this is what I get:

Integration Program runs with 56 threads, padding=1
Integration Result: 3.141593
0.006488 Time passed for Integration...
Effective Total Time: 0.363319

Integration Program runs with 56 threads, padding=2
Integration Result: 3.141593
0.006484 Time passed for Integration...
Effective Total Time: 0.363106

Integration Program runs with 56 threads, padding=3
Integration Result: 3.141593
0.006213 Time passed for Integration...
Effective Total Time: 0.347925

Integration Program runs with 56 threads, padding=4
Integration Result: 3.141593
0.006125 Time passed for Integration...
Effective Total Time: 0.342999

Integration Program runs with 56 threads, padding=5
Integration Result: 3.141593
0.006641 Time passed for Integration...
Effective Total Time: 0.371904

Integration Program runs with 56 threads, padding=6
Integration Result: 3.141593
0.006988 Time passed for Integration...
Effective Total Time: 0.391317

Integration Program runs with 56 threads, padding=7
Integration Result: 3.141593
0.006617 Time passed for Integration...
Effective Total Time: 0.370556

Integration Program runs with 56 threads, padding=8
Integration Result: 3.141593
0.006138 Time passed for Integration...
Effective Total Time: 0.343719

In other words: with modern processors false sharing is no longer a problem. The processor keeps a separate accumulator on each core and does not write to the falsely shared locations until it's absolutely necessary.

EDIT since there was a suggestion that this only works because of the static bounds, I've made a version of the code with

#define TPINDEX(t,p) t*PAD+p

void main(){
  //  int i, nthreads;
  omp_set_num_threads(NTHREADS);
  double pi=0,
    *sum = (double*) malloc( NTHREADS*PAD*sizeof(double) );
#pragma omp parallel for
  for (int t=0; t<NTHREADS; t++)
    for (int p=0; p<PAD; p++)
      sum[ TPINDEX(t,p) ] = 0;

and

  int nthreads;
#pragma omp parallel
  {
    int id, nthrds;
    double x;
    id = omp_get_thread_num();
    nthrds = omp_get_num_threads();
    if(id==0) nthreads = nthrds;
    for(int i=id;i<num_steps;i=i+nthrds){
      x = (i+0.5)*step;
      sum[ TPINDEX(id,0) ] += 4.0/(1.0+x*x);
    }
  }
  for (int i=0; i<nthreads; i++)
    pi += sum[ TPINDEX(i,0) ]*step;

and I get basically the same:

[c202-001 c:7] make run_mattmal ECHO=0
Integration Program runs with 56 threads, padding=1
Integration Result: 3.141593
0.001773 Time passed for Integration...
Effective Total Time: 0.099295

Integration Program runs with 56 threads, padding=2
Integration Result: 3.141593
0.001569 Time passed for Integration...
Effective Total Time: 0.087866

Integration Program runs with 56 threads, padding=3
Integration Result: 3.141593
0.002002 Time passed for Integration...
Effective Total Time: 0.112112

Integration Program runs with 56 threads, padding=4
Integration Result: 3.141593
0.001569 Time passed for Integration...
Effective Total Time: 0.087852

Integration Program runs with 56 threads, padding=5
Integration Result: 3.141593
0.001550 Time passed for Integration...
Effective Total Time: 0.086798

Integration Program runs with 56 threads, padding=6
Integration Result: 3.141593
0.001598 Time passed for Integration...
Effective Total Time: 0.089481

Integration Program runs with 56 threads, padding=7
Integration Result: 3.141593
0.001582 Time passed for Integration...
Effective Total Time: 0.088587

Integration Program runs with 56 threads, padding=8
Integration Result: 3.141593
0.001573 Time passed for Integration...
Effective Total Time: 0.088093

Looking at the assembly code, it looks like that the compiler is able to figure this out for this particular case. If I change the array to be dynamically allocated, the compiler does not longer accumulate in a register, but spills to memory. — Michael Klemm, May 17 '22 at 07:02
"*with modern processors false sharing is no longer a problem*" is I think completely wrong. It is a problem on my (2-year old) machine when `PAD` is small. I can also reproduce the problem on other (recent) machine too. Intel processor works at the 64-bit cache line granularity and I found nothing about such an optimization on more recent processors in the Intel documentation. Without cache line splitting and assuming the register is written back in memory (eg. in `-O0`), processors *have to* share full cache lines and false sharing cannot be avoided due to the cache coherence protocol. — Jérôme Richard, May 17 '22 at 12:30
Still, the provided results are very interesting (and surprising) assuming the compiler optimizations did not cause the program to operate in registers. Can you please provide the optimization flags used to build the program, the version of GCC and the target processor used for the experiment? — Jérôme Richard, May 17 '22 at 12:37
@JérômeRichard "assuming the register is written back in memory" That's a big assumption. In this particular code it's pretty easy for the compiler to figure out that there is no such need. My flags were straight `-O2 -qopenmp`, compiler is Intel 2019, processor is dual 28-core skylake. — Victor Eijkhout, May 17 '22 at 19:15
@VictorEijkhout Ok. Thank you. Thus, with `-O2`, ICC put the array in registers (and vectorize the code), so that there is no memory access and so no false sharing at all. In the end, it does not mean false sharing is a "no longer a problem", but just that there is no need to use any padding with most optimizing compilers in this case. False sharing is still an issue in general. Skylake & Coffee-lake processors do not optimize that because they operate on full cache-lines and the cache coherence protocol used prevent that (AFAIK none of the ones used in mainstream processors support that). — Jérôme Richard, May 17 '22 at 20:31
@JérômeRichard How do you read assembly? Where can I learn to see that a C Code bas been "unrolled" and "vectorized " by the compiler? I also want to learn about how memory works and the registers and whatnot? Where can I read more about this? Thank you. — GingerGengar123, May 18 '22 at 01:40
@GingerGengar123 These days many people use "godbolt", the Compiler Explorer as an easy way to generate & read assembly. Or you could tell your compiler to spit out the assembly. Often there are comments indicating what source line corresponds to what assembly line. Unrolling is usually easy to detect because some crucial instruction (for instance a multiply) appears multiple times. — Victor Eijkhout, May 18 '22 at 01:52
@JérômeRichard Dynamic allocation added. Btw, it was an Ice Lake. Let me try a Sky Lake. (I have too many of those lakes!) — Victor Eijkhout, May 18 '22 at 12:10
@MichaelKlemm I've tested dynamic allocation and again padding is unnecessary. Which compiler were you using? Note that gcc is in general very bad at OpenMP. That said, I only get 40% more run time, but padding still doesn't matter. — Victor Eijkhout, May 18 '22 at 12:17
@JérômeRichard Sky lake (dual 24 core, same `-O2`) same story. This time gcc is even worse: double runtime of Intel. But I see no effect of padding. — Victor Eijkhout, May 18 '22 at 12:29
@VictorEijkhout The result using `-O2` (on GCC/Clang) is independent of the processor architecture unless you use `-march=native` or something similar (not mentioned). Thus, this is normal you get the same results (ie. no false sharing) due to the register-based optimization. If you use `-O0`, then I expect a poor scalability on Skylake and Icelake. Alternatively, you could use a `volatile` sum array (which should give the same result independently of compilers and optimizations). Note that GCC is slower because it does not use SIMD instructions (but this is an independent problem). — Jérôme Richard, May 18 '22 at 21:41

Jérôme Richard · Answer 2 · 2022-05-17T12:24:02.250

TL;DR: compiler optimizations and hyper-threading plays a huge role on the observed effect. Frequency scaling can impact the scalability too. In fact, the provided results are actually not a sufficient evidence to claim false sharing is the main issue.

Compiler optimizations

First of all, optimizations have a huge impact on the benchmark since they prevent any false sharing effect. Indeed, with optimization -O1, GCC 12 is able to store many variable in registers (but not sum). In -O2 and -O3, GCC 12 is able to store the sum array only in registers so any false sharing effect cannot be seen. This is why optimization must be disabled not to introduce any bias in this benchmark. Alternatively, on can use the volatile keyword to prevent the compiler optimizing memory accesses (so to be able to use optimizations).

Here is the assembly code of the hot loop in -O0 with GCC 12.1:

.L8:
        mov     eax, DWORD PTR [rbp-4]
        movsx   rdx, eax
        mov     rax, QWORD PTR num_steps[rip]
        cmp     rdx, rax
        jge     .L11
        pxor    xmm1, xmm1
        cvtsi2sd        xmm1, DWORD PTR [rbp-4]
        movsd   xmm0, QWORD PTR .LC6[rip]
        addsd   xmm1, xmm0
        movsd   xmm0, QWORD PTR step[rip]
        mulsd   xmm0, xmm1
        movsd   QWORD PTR [rbp-24], xmm0
        mov     rax, QWORD PTR [rbp-40]
        mov     rax, QWORD PTR [rax]
        mov     edx, DWORD PTR [rbp-8]
        movsx   rdx, edx
        sal     rdx, 6
        add     rax, rdx
        movsd   xmm1, QWORD PTR [rax]
        movsd   xmm0, QWORD PTR [rbp-24]
        movapd  xmm2, xmm0
        mulsd   xmm2, xmm0
        movsd   xmm0, QWORD PTR .LC1[rip]
        addsd   xmm2, xmm0
        movsd   xmm0, QWORD PTR .LC7[rip]
        divsd   xmm0, xmm2
        addsd   xmm0, xmm1
        mov     rax, QWORD PTR [rbp-40]
        mov     rax, QWORD PTR [rax]
        mov     edx, DWORD PTR [rbp-8]
        movsx   rdx, edx
        sal     rdx, 6
        add     rax, rdx
        movsd   QWORD PTR [rax], xmm0
        mov     eax, DWORD PTR [rbp-12]
        add     DWORD PTR [rbp-4], eax
        jmp     .L8

Here is the same code and the same compiler with -O1:

.L4:
        pxor    xmm0, xmm0
        cvtsi2sd        xmm0, edx
        addsd   xmm0, xmm4
        mulsd   xmm0, QWORD PTR step[rip]
        mulsd   xmm0, xmm0
        addsd   xmm0, xmm3
        movapd  xmm1, xmm2
        divsd   xmm1, xmm0
        addsd   xmm1, QWORD PTR [rcx]
        movsd   QWORD PTR [rcx], xmm1
        add     edx, eax
        cmp     edx, 99999999
        jle     .L4

Here is the same code and the same compiler with -O2:

.L4:
        pxor    xmm0, xmm0
        movapd  xmm2, xmm3
        cvtsi2sd        xmm0, edx
        add     edx, eax
        addsd   xmm0, xmm5
        mulsd   xmm0, xmm6
        mulsd   xmm0, xmm0
        addsd   xmm0, xmm4
        divsd   xmm2, xmm0
        addsd   xmm1, xmm2
        cmp     edx, 99999999
        jle     .L4

One can see that not load/store operations are used with -O2 in the hot computing loop using GCC 12. This can also be seen on Godbolt. Results may change from one version of GCC to another.

Hyper-threading

Regarding the effect of threads on the performance, I am not able to reproduce the problem on my i5-9600KF processor: I see no significant effect of false sharing. More precisely, the value of timepass is about 5.5x~5.6x time smaller with 6 threads (on 6 cores, which is very good -- see later). This processor has the same architecture than your i7-8750H: it is an Intel Coffee Lake processor (though mine is a "Refresh"). Thus, the behaviour of the core should be exactly the same on this benchmark. The layout of the cores might change, but the two processor have the same number of cores (6) and AFAIK there is no change in the layout of the cores between the two (at least based on informations provided by Intel). The major difference is that i7 processors have Hyper-Threading while i5 processors does not. This is certainly why results are so different on your processor. In fact, your results are very unstable even when the same number of thread is used and with the same PAD value which mean that the execution context play a huge role in the performance results. I think two threads are sometimes bound to the same core resulting in a much slower execution time. In fact 2 time slower in the worst case (threads of the same core can share only a part of the resources).

To check this hypothesis, you need to force each threads to be bound to different cores. This can be done using the OMP_PROC_BIND and OMP_PLACES. You can use hwloc-ls and hwloc-ps tools to actually check the layout of the logical cores and the binding of the application threads on logical/physical cores. hwloc-calc can be used to script the binding.

In practice, you can use the following Bash script to run your program with a better thread binding:

# Bind each thread to the logical core 0 of each physical core
export OMP_PROC_BIND=TRUE
export OMP_PLACES={$(hwloc-calc --li --po -I PU CORE:all.PU:0 --sep "},{")}
export OMP_DISPLAY_ENV=TRUE
./your_program

Frequency scaling

Note that Intel processors use a frequency scaling method to adapt the frequency regarding the number of working threads and regarding what they actually do (eg. using wide SIMD instructions like AVX one cause a lower frequency to be used). Intel does that so the overall processor package does not consume more than a power budget (so to reduce power and thermal issues). For example, on my processor, 1 core operates at 4.5 GHz while 6 core operate at 4.3 GHz in practice on your benchmark. This impacts a bit the scalability of your code since using more cores makes them run a bit slower. AFAIK, this is especially true on energy-efficient processors like yours. Indeed, the H class means "high-performance optimized for mobile" and such processor have more thermal limitations than high-performance desktop processor like mine. Additionally, I have a "Refresh" Coffee Lake architecture which also impact the thermal throttling of the processor (they are better than non-"Refresh" processor like yours). To quote Wikipedia:

On October 8, 2018, Intel announced what it branded its ninth generation of Core processors, the Coffee Lake Refresh family. To avoid running into thermal problems at high clock speeds, Intel soldered the integrated heat spreader (IHS) to the CPU die instead of using thermal paste as on the Coffee Lake processors.

Still, I expect the effect of thermal throttling to be relatively small and not the main issue though it plays a role in the resulting scalability.

Better benchmarking with performance counters

Since the timing can be affected by other effect than false sharing, it is wise to take a more scientific approach than simply analysing the execution time and guessing the probable cause. More specifically, if false sharing is responsible for the biggest part of the time, the cache should be impacted: a cache line bouncing effect should be seen. X86-64 processors have hardware performance counters to monitor such an effect. This require a good understanding of the cache coherence protocol like MESI or MOESI. I expect the number of Request For Ownership (RFO) operations between cores to sharply increase if there is some false sharing happening. This metric can be seen using perf on Linux (or Intel VTune). I think the hardware counter l2_rqsts.all_rfo should be the right one to check the effect on your processor. On my machine, I confirm the metric is >10 times bigger when there are false sharing issues (eg. when pad is small and the program poorly scale).

This is great! I agree with everything you say. Firstly, where can I learn more about the cache protocols in general? Is there a book you would recommend to explain things like RFO? — GingerGengar123, May 18 '22 at 01:18
I took the program, and ran 2 tests: The first one is with ```PAD 8``` and with ```NUM_THREADS 1```. That program ran in 7.4 seconds, 707 l2 RFO using perf. I ran the program again with ```PAD 8``` and ```NUM_THREADS 2```. This time, it used up 10.3 seconds, and has almost a million times more l2 RFO. By the way, I compiled using gcc lastest version, no optimisations enabled. I also used the fix you recommended, with the binding of threads to each core. I have a large padding of 8. Since l2 RFO is a parameter for false sharing, I am confused why I still have false sharing. — GingerGengar123, May 18 '22 at 01:28
I am not sure, Is there anything wrong software-wise? It seems that the poor scalability problem is hardware related because even with the fixes you mentioned using the bash script, I am still getting unstable performance. Sometimes the code works great with linera scalability, some other times, not really, even though they are the same exact binary. — GingerGengar123, May 18 '22 at 01:34
@GingerGengar123 Download my HPC textbook at theartofhpc.com which has information about caches and other matters. — Victor Eijkhout, May 18 '22 at 11:54
This book is indeed great for an HPC introduction (and has the benefit of being free and easily accessible). It helped me during de beginning my PhD thesis. So thank you @VictorEijkhout. — Jérôme Richard, May 18 '22 at 21:09
The famous [What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf) also has a section about cache coherence. The article is a bit old but most things are still true today as described in [this post](https://stackoverflow.com/questions/8126311). The wikipedia articles mentioned in the above answer (and the related ones) gives more details on specific cache coherence protocol. AFAIK, Intel uses a cache protocol similar to [MESIF](https://en.wikipedia.org/wiki/MESIF_protocol). — Jérôme Richard, May 18 '22 at 21:20
As for the assembly code and RFOs, I am a bit surprised by this. Can you check with bigger PAD values like 16 or 32? AFAIK, Some part of Intel processors work at a bigger granularity. For example, Intel prefetchers on relatively recent processors preload 128 bytes systematically (not just a cache line), but I do not think this should applies here so this is why I find this surprising. For the assembly code, you can use the flag `-S` of GCC to generate the assembly code as far as I remember (it would be great if you could share it btw). — Jérôme Richard, May 18 '22 at 21:33
By the way, to test the hypothesis of hyperthreading being the main culprit, I tested the algorithm against another older machine with intel apollo lake as its processor. It was unstable. Same results. Sometimes linear speedup was obtained. Sometimes not, even the same exact binary, Compiled with gcc no optimizations applied. — GingerGengar123, May 19 '22 at 15:09

Array Padding does not mitigate false sharing? C, OpenMP

2 Answers2

Compiler optimizations

Hyper-threading

Frequency scaling

Better benchmarking with performance counters