The key bottleneck is the load-use latency of cache misses while striding through tex[]
, as part of the dependency chain from old i
to new i
. Lookups from the small rand_array[]
will hit in cache and just add about 5 cycles of L1d load-use latency to the loop-carried dependency chain, making it even more unlikely for anything other than latency through i
to be relevant to the overall speed of the loop. (Jérôme Richard's answer on your previous question discusses that latency dep chain and the random stride increment.)
memcmp(tex+i, str, SLen)
might be acting as prefetching if tex+i
is near the end of a cache line. If you compiled with -O0
, you're calling glibc memcmp so the SSE 16-byte or AVX2 32-byte load done in glibc memcmp crosses a cache-line boundary and pulls in the next line. (IIRC, glibc memcmp checks and avoids page crossing, but not cache-line splits, because those are relatively cheap on modern CPUs. Single-step into it to see. Into the 2nd call to avoid lazy dynamic linking, or compile with -fno-plt
).
Possibly touching some cache lines that would have been skipped by the random stride helps HW prefetch be more aggressive and detect it as a sequential access pattern. A 32-byte AVX load is half a cache line, so it's fairly likely to cross into the next one.
So a more-predictable sequence of cache-line requests seen by the L2 cache is a plausible explanation for the memcmp version being faster. (Intel CPUs put the main prefetchers in the L2.)
If you compile with optimization, that 10-byte memcmp inlines and in the inner loop only touches 8 bytes. (If they match, then it jumps to another block where it checks the last 2. But it's exceedingly unlikely that ever happens).
This might be why -O0
is faster than -O3
on my system, with GCC11.1 on i7-6700k Skylake with DDR4-2666 DRAM, on Arch GNU/Linux. (Your question didn't bother to specify which compiler and options your numbers were from, or which hardware.)
$ gcc -O3 array-stride.c
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,mem_load_retired.l1_hit,mem_load_retired.l1_miss -r 2 ./a.out
No.1: 0.041710 s
No.2: 0.063072 s
No.1: 0.040457 s
No.2: 0.061166 s
Performance counter stats for './a.out' (2 runs):
1,843.71 msec task-clock # 0.999 CPUs utilized ( +- 0.14% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
73,300 page-faults # 39.757 K/sec ( +- 0.00% )
6,607,078,798 cycles # 3.584 GHz ( +- 0.06% )
18,614,303,386 instructions # 2.82 insn per cycle ( +- 0.00% )
19,407,689,229 uops_issued.any # 10.526 G/sec ( +- 0.02% )
21,970,261,576 uops_executed.thread # 11.916 G/sec ( +- 0.02% )
5,408,090,087 mem_load_retired.l1_hit # 2.933 G/sec ( +- 0.00% )
3,861,687 mem_load_retired.l1_miss # 2.095 M/sec ( +- 2.11% )
1.84537 +- 0.00135 seconds time elapsed ( +- 0.07% )
$ grep . /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference:balance_performance
... same on all 8 logical cores
The perf counter data is basically meaningless; it's for the whole run time, not the timed regions (which are about 0.1 sec out of 1.8 sec), so the majority of the time spent here was in glibc random(3)
which uses "a nonlinear additive feedback random number generator employing a default table of size 31 long integers". Also in page-faults on the big malloc region.
It's only interesting in terms of the delta between two builds, but the loops outside the timed region still contribute many extra uops so it's still not as interesting as one would like.
vs. gcc -O0
: No.1 is faster, No.2 is as expected a bit slower, with -O0 putting a store/reload into the dependency chain involving i
.
$ gcc -O0 array-stride.c
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,mem_load_retired.l1_hit,mem_load_retired.l1_miss -r 2 ./a.out
No.1: 0.028402 s
No.2: 0.076405 s
No.1: 0.028079 s
No.2: 0.069492 s
Performance counter stats for './a.out' (2 runs):
1,979.57 msec task-clock # 0.999 CPUs utilized ( +- 0.04% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
66,656 page-faults # 33.672 K/sec ( +- 0.00% )
7,252,728,414 cycles # 3.664 GHz ( +- 0.02% )
20,507,166,672 instructions # 2.83 insn per cycle ( +- 0.01% )
22,268,130,378 uops_issued.any # 11.249 G/sec ( +- 0.00% )
25,117,638,171 uops_executed.thread # 12.688 G/sec ( +- 0.00% )
6,640,523,801 mem_load_retired.l1_hit # 3.355 G/sec ( +- 0.01% )
3,350,518 mem_load_retired.l1_miss # 1.693 M/sec ( +- 1.39% )
1.9810591 +- 0.0000934 seconds time elapsed ( +- 0.00% )
Remember this profiling is for the total run time, not the timed regions, so the 2.83 IPC is not for the timed region.
Out of order exec hiding cost of independent work
The actual throughput cost of running that macro-fused cmp
/je
uop doesn't add to any bottleneck, thanks to out-of-order exec. Or even a whole call memcmp@plt
and setting up the args. The bottleneck is latency, not the front-end, or load ports in the back end, and the OoO exec window is deep enough to hide the memcmp work. See also the following to understand modern CPUs. (Yes, it will take a lot of reading to wrap your head around them.)
The loop-carried dependency through i
goes i
-> tex[i]
-> indirectly back into i
for next iteration, monotonically increasing it. tex[]
is too big to fit in cache, and HW prefetch won't keep up with those strides.
So DRAM bandwidth might be the limiting factor, or even DRAM latency if HW prefetch doesn't detect and lock on to the somewhat-sequential access pattern to consecutive or every-other cache line.
The actual branch predicts perfectly, so it can execute (and confirm the prediction) when the load result arrives, in parallel with the stuff waiting for the sign-extending byte load that started at the same place.
and the execution times are exactly the same. ... it actually becomes a lot faster.
Huh? Your times don't show exactly the same. Yes, it does become measurably faster with that memcmp
.
In theory, it should be slower.
Only if your theory is too simplistic to model modern CPUs. Not slowing it down is easily explainable because there is spare out-of-order exec throughput to do that independent work while waiting for the load latency.
Also possible related for -O0
performance:
For reference, the with-memcmp inner loops from GCC on Godbolt, matching what I benchmarked on my desktop.
# gcc11.1 -O0 -fPIE
.L10: # do{
mov eax, DWORD PTR -16[rbp]
movsx rdx, eax
mov rax, QWORD PTR -32[rbp]
lea rcx, [rdx+rax]
mov rax, QWORD PTR -40[rbp]
mov edx, 10
mov rsi, rax
mov rdi, rcx
call memcmp@PLT
test eax, eax
jne .L9 # normally taken jump over puts
lea rax, .LC0[rip]
mov rdi, rax
call puts@PLT
.L9:
mov eax, DWORD PTR -16[rbp]
movsx rdx, eax
mov rax, QWORD PTR -32[rbp]
add rax, rdx # strange not using an addressing mode, but ok
movzx eax, BYTE PTR [rax]
movsx eax, al # GCC -O0 is dumb,
cdqe # really dumb.
mov eax, DWORD PTR -576[rbp+rax*4] # look up in rand_array[]
add DWORD PTR -16[rbp], eax # i += ...
.L8:
cmp DWORD PTR -16[rbp], 299999999
jle .L10 # }while(i<300000000)
vs. the -O3 code:
# RBP holds char *tex at at this point
.L10: # do{
movsx r12, r13d # sign-extend i
mov rax, QWORD PTR [rbx] # str[0..7] gets reloaded because alias analysis and the possible call to puts defeated the optimizer. Avoiding malloc for it may have helped.
add r12, rbp # i+tex to avoid indexed addressing modes later? Probably not optimal
cmp QWORD PTR [r12], rax # first 8 bytes of the memcmp
je .L18 # code at .L18 checks next 2 and maybe does puts, before jumping back
.L5:
movsx rax, BYTE PTR [r12] # sign-extending byte load to pointer width, of tex[i]
add r13d, DWORD PTR [rsp+rax*4] # i += look up in rand_array[]
cmp r13d, 299999999
jle .L10 # }while(i < 300000000)
Since the je .L18
is never taken, this is 7 uops per iteration, so Skylake can comfortable issue it at under 2 cycles per iteration.
Even with L1d cache hits, the loop-carried dependency through i
(R13D) is:
So a total of about 13 cycle latency best case with L1d hits, leaving tons of free "slots" in the front-end, and idle back-end execution units, that even a call to actual glibc memcmp would be no big deal.
(Of course the -O0
code is noisier so some of those free slots in the pipeline area already used up, but the dep chain is even longer because -O0
code keeps i
in memory: Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?)
CPU frequency in memory-bottlenecked loops, especially on Skylake
The init loops provide plenty of time for the CPU to get to max turbo before the timed region, and to pre-fault the allocated pages by touching them first. Idiomatic way of performance evaluation?
There might also be an effect of keeping CPU frequency higher if there's more work, less stalling, during those waits for incoming data from cache miss loads. See Slowing down CPU Frequency by imposing memory stress.
In the above results, I tested with EPP = balance_performance
.
I also tested with performance
, and -O0 (libc memcmp) is still faster than -O3 (inlined memcmp), but all 4 loops did speed up some (with/without memcmp, and optimized or not).
compiler/CPU |
No.1 if(memcmp) |
No.2 plain |
-O0 / balance_performance |
0.028079 s |
0.069492 s |
-O0 / performance |
0.026898 s |
0.053805 s |
-O3 / balance_performance |
0.040457 s |
0.061166 s |
-O3 / performance |
0.035621 s |
0.047475 s |
The faster-at--O0
effect is still present and very significant even with EPP=performance
, so we know it's not just a matter of clock speed. From previous experiments, performance
keeps it pegged at max turbo even in memory-bound conditions where other settings would clock down from 4.2 to 2.7 GHz. So it's very likely that calling memcmp is helping trigger better prefetching to reduce average cache-miss latency.
I unfortunately didn't use a fixed seed for the PRNG so there may be some variation due to the randomness being good vs. bad for the prefetcher, in terms of patterns of whole cache lines accessed. And I just took one specific run for each (the 2nd one in the pair started by perf stat -r 2, so it should hopefully be even less affected by system fluctuation.)
Looks like performance
made a bigger difference to the No.2 loop (with less going on in it), and also to the -O3
version (again with less going on in it), which is consistent with Skylake lowering clock speed when a core is pretty much only bottlenecking on memory, not running many other instructions, in EPP settings other than performance
.