TL;DR: compiler optimizations and hyper-threading plays a huge role on the observed effect. Frequency scaling can impact the scalability too. In fact, the provided results are actually not a sufficient evidence to claim false sharing is the main issue.
Compiler optimizations
First of all, optimizations have a huge impact on the benchmark since they prevent any false sharing effect. Indeed, with optimization -O1
, GCC 12 is able to store many variable in registers (but not sum
). In -O2
and -O3
, GCC 12 is able to store the sum
array only in registers so any false sharing effect cannot be seen. This is why optimization must be disabled not to introduce any bias in this benchmark. Alternatively, on can use the volatile
keyword to prevent the compiler optimizing memory accesses (so to be able to use optimizations).
Here is the assembly code of the hot loop in -O0
with GCC 12.1:
.L8:
mov eax, DWORD PTR [rbp-4]
movsx rdx, eax
mov rax, QWORD PTR num_steps[rip]
cmp rdx, rax
jge .L11
pxor xmm1, xmm1
cvtsi2sd xmm1, DWORD PTR [rbp-4]
movsd xmm0, QWORD PTR .LC6[rip]
addsd xmm1, xmm0
movsd xmm0, QWORD PTR step[rip]
mulsd xmm0, xmm1
movsd QWORD PTR [rbp-24], xmm0
mov rax, QWORD PTR [rbp-40]
mov rax, QWORD PTR [rax]
mov edx, DWORD PTR [rbp-8]
movsx rdx, edx
sal rdx, 6
add rax, rdx
movsd xmm1, QWORD PTR [rax]
movsd xmm0, QWORD PTR [rbp-24]
movapd xmm2, xmm0
mulsd xmm2, xmm0
movsd xmm0, QWORD PTR .LC1[rip]
addsd xmm2, xmm0
movsd xmm0, QWORD PTR .LC7[rip]
divsd xmm0, xmm2
addsd xmm0, xmm1
mov rax, QWORD PTR [rbp-40]
mov rax, QWORD PTR [rax]
mov edx, DWORD PTR [rbp-8]
movsx rdx, edx
sal rdx, 6
add rax, rdx
movsd QWORD PTR [rax], xmm0
mov eax, DWORD PTR [rbp-12]
add DWORD PTR [rbp-4], eax
jmp .L8
Here is the same code and the same compiler with -O1
:
.L4:
pxor xmm0, xmm0
cvtsi2sd xmm0, edx
addsd xmm0, xmm4
mulsd xmm0, QWORD PTR step[rip]
mulsd xmm0, xmm0
addsd xmm0, xmm3
movapd xmm1, xmm2
divsd xmm1, xmm0
addsd xmm1, QWORD PTR [rcx]
movsd QWORD PTR [rcx], xmm1
add edx, eax
cmp edx, 99999999
jle .L4
Here is the same code and the same compiler with -O2
:
.L4:
pxor xmm0, xmm0
movapd xmm2, xmm3
cvtsi2sd xmm0, edx
add edx, eax
addsd xmm0, xmm5
mulsd xmm0, xmm6
mulsd xmm0, xmm0
addsd xmm0, xmm4
divsd xmm2, xmm0
addsd xmm1, xmm2
cmp edx, 99999999
jle .L4
One can see that not load/store operations are used with -O2
in the hot computing loop using GCC 12. This can also be seen on Godbolt. Results may change from one version of GCC to another.
Hyper-threading
Regarding the effect of threads on the performance, I am not able to reproduce the problem on my i5-9600KF processor: I see no significant effect of false sharing. More precisely, the value of timepass is about 5.5x~5.6x time smaller with 6 threads (on 6 cores, which is very good -- see later). This processor has the same architecture than your i7-8750H: it is an Intel Coffee Lake processor (though mine is a "Refresh"). Thus, the behaviour of the core should be exactly the same on this benchmark. The layout of the cores might change, but the two processor have the same number of cores (6) and AFAIK there is no change in the layout of the cores between the two (at least based on informations provided by Intel). The major difference is that i7 processors have Hyper-Threading while i5 processors does not. This is certainly why results are so different on your processor. In fact, your results are very unstable even when the same number of thread is used and with the same PAD
value which mean that the execution context play a huge role in the performance results. I think two threads are sometimes bound to the same core resulting in a much slower execution time. In fact 2 time slower in the worst case (threads of the same core can share only a part of the resources).
To check this hypothesis, you need to force each threads to be bound to different cores. This can be done using the OMP_PROC_BIND
and OMP_PLACES
. You can use hwloc-ls
and hwloc-ps
tools to actually check the layout of the logical cores and the binding of the application threads on logical/physical cores. hwloc-calc
can be used to script the binding.
In practice, you can use the following Bash script to run your program with a better thread binding:
# Bind each thread to the logical core 0 of each physical core
export OMP_PROC_BIND=TRUE
export OMP_PLACES={$(hwloc-calc --li --po -I PU CORE:all.PU:0 --sep "},{")}
export OMP_DISPLAY_ENV=TRUE
./your_program
Frequency scaling
Note that Intel processors use a frequency scaling method to adapt the frequency regarding the number of working threads and regarding what they actually do (eg. using wide SIMD instructions like AVX one cause a lower frequency to be used). Intel does that so the overall processor package does not consume more than a power budget (so to reduce power and thermal issues). For example, on my processor, 1 core operates at 4.5 GHz while 6 core operate at 4.3 GHz in practice on your benchmark. This impacts a bit the scalability of your code since using more cores makes them run a bit slower. AFAIK, this is especially true on energy-efficient processors like yours. Indeed, the H
class means "high-performance optimized for mobile" and such processor have more thermal limitations than high-performance desktop processor like mine. Additionally, I have a "Refresh" Coffee Lake architecture which also impact the thermal throttling of the processor (they are better than non-"Refresh" processor like yours). To quote Wikipedia:
On October 8, 2018, Intel announced what it branded its ninth generation of Core processors, the Coffee Lake Refresh family. To avoid running into thermal problems at high clock speeds, Intel soldered the integrated heat spreader (IHS) to the CPU die instead of using thermal paste as on the Coffee Lake processors.
Still, I expect the effect of thermal throttling to be relatively small and not the main issue though it plays a role in the resulting scalability.
Better benchmarking with performance counters
Since the timing can be affected by other effect than false sharing, it is wise to take a more scientific approach than simply analysing the execution time and guessing the probable cause. More specifically, if false sharing is responsible for the biggest part of the time, the cache should be impacted: a cache line bouncing effect should be seen. X86-64 processors have hardware performance counters to monitor such an effect.
This require a good understanding of the cache coherence protocol like MESI or MOESI. I expect the number of Request For Ownership (RFO) operations between cores to sharply increase if there is some false sharing happening. This metric can be seen using perf
on Linux (or Intel VTune). I think the hardware counter l2_rqsts.all_rfo
should be the right one to check the effect on your processor. On my machine, I confirm the metric is >10 times bigger when there are false sharing issues (eg. when pad is small and the program poorly scale).