Why does adding an if(!memcmp()) speed up a loop that makes random short strides through a huge byte array?

Question

Sorry, I have never understood the rules here. I have deleted all the duplicate posts I have posted. This is the first related issue. Please do not mark this post as a duplicate of my other one(Reduce the number of executions by 3 times, but the execution efficiency is almost unchanged. In C), even if the code is somewhat similar, they raise very different questions.These are also two problems that I found on the same day. A similar post has been repeated by "misjudgment" and then closed. It may be that I did not clarify this issue. I really hope to get the answer, so I reposted it. I hope everyone can see the problem clearly, thank you very much!

In the following C code, I added an "if" statement to the loop of the first test time, and the execution times are exactly the same. In theory, it should be slower. Even though branch prediction can make their performance almost the same, it actually becomes a lot faster. What is the principle of this? I tried to use clang and gcc compilers to run in Mac and Linux environments respectively, and tried various optimization levels. In order to prevent the cache from being affected, I let the faster ones execute first, but the loops with redundant code execute faster.

If you think my description is not credible, please compile the following code into your computer and run it. Hope someone can answer this question for me，thanks.

C code:

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>

#define TLen 300000000
#define SLen 10

int main(int argc, const char * argv[]) {
    srandom((unsigned)time(NULL));
    
    // An array to increase the index,
    // the range of its elements is 1-256
    int rand_arr[128];
    for (int i = 0; i < 128; ++i)
        rand_arr[i] = random()%256+1;
    
    // A random text(very long), the range of its elements is 0-127
    char *tex = malloc((sizeof *tex) * TLen);
    for (int i = 0; i < TLen; ++i)
        tex[i] = random()%128;
    
    // A random string(very short))
    char *str = malloc((sizeof *str) * SLen);
    for (int i = 0; i < SLen; ++i)
        str[i] = random()%128;
    
    // The first testing
    clock_t start = clock();
    for (int i = 0; i < TLen; ){
        if (!memcmp(tex+i, str, SLen)) printf("yes!\n");
        i += rand_arr[tex[i]];
    }
    clock_t end = clock();
    printf("No.1: %lf s\n", ((double)(end - start)) / CLOCKS_PER_SEC);
    
    // The second testing
    start = clock();
    for (int i = 0; i < TLen; ){
        i += rand_arr[tex[i]];
    }
    end = clock();
    printf("No.2: %lf s\n", ((double)(end - start)) / CLOCKS_PER_SEC);
    
    return 0;
}

I have run hundreds of times, almost all of this ratio. The following is a representative result of a test in Linux:

No.1: 0.110000 s
No.2: 0.140000 s

After the first test all the data is already in the cache. In fact the whole calculation is repeating and theoretically can be optimized out completely. — Eugene Sh., Sep 16 '21 at 18:19
As in your [previous question](https://stackoverflow.com/questions/69211522/reduce-the-number-of-executions-by-3-times-but-the-execution-efficiency-is-almo) you aren't timing a long enough period for the differences to be meaningful. Function `clock()` is a blunt tool. — Weather Vane, Sep 16 '21 at 18:19
@EugeneSh. The data is already in cache before the first test. — ElderBug, Sep 16 '21 at 18:21
@ Eugene Sh The results are basically the same regardless of whether it is tested alone or vice versa. — Linke, Sep 16 '21 at 18:22
@Linke Yes. So this test is pretty meaningless. You need to force the same calculation to be fully performed to compare. For that you should repopulate your arrays with different data, and also make sure the final result is not discarded, so the calculation is not optimized out. And since your data is random and the number of iterations is random, you need to perform the test a lot of times to obtain reliable average. — Eugene Sh., Sep 16 '21 at 18:25
How does the code print `count1: 77595` and `count2: 77595`? — Weather Vane, Sep 16 '21 at 18:27
@Eugene Sh Although it is not very average, it fluctuates within a certain range. — Linke, Sep 16 '21 at 18:28
@ Weather Vane Sorry, the previous code has been changed (some unnecessary things have been deleted), I forgot to change this. Has been changed. — Linke, Sep 16 '21 at 18:32
Aside: Why use `tex` in `str = malloc((sizeof *tex) * SLen);`? — chux - Reinstate Monica, Sep 16 '21 at 22:31
@WeatherVane: The timed region is not that short anymore, in the current version of the code. — Peter Cordes, Sep 17 '21 at 00:35
Also, in the current version of the code, it should be striding through `tex[]` monotonically, and it's way too big to stay in cache, @ElderBug. (Despite using signed `char` for increments to `i`, the random generation rangemakes sure they're all signed positive.) — Peter Cordes, Sep 17 '21 at 00:35

Peter Cordes · Accepted Answer · 2021-09-17T07:32:50.417

The key bottleneck is the load-use latency of cache misses while striding through tex[], as part of the dependency chain from old i to new i. Lookups from the small rand_array[] will hit in cache and just add about 5 cycles of L1d load-use latency to the loop-carried dependency chain, making it even more unlikely for anything other than latency through i to be relevant to the overall speed of the loop. (Jérôme Richard's answer on your previous question discusses that latency dep chain and the random stride increment.)

memcmp(tex+i, str, SLen) might be acting as prefetching if tex+i is near the end of a cache line. If you compiled with -O0, you're calling glibc memcmp so the SSE 16-byte or AVX2 32-byte load done in glibc memcmp crosses a cache-line boundary and pulls in the next line. (IIRC, glibc memcmp checks and avoids page crossing, but not cache-line splits, because those are relatively cheap on modern CPUs. Single-step into it to see. Into the 2nd call to avoid lazy dynamic linking, or compile with -fno-plt).

Possibly touching some cache lines that would have been skipped by the random stride helps HW prefetch be more aggressive and detect it as a sequential access pattern. A 32-byte AVX load is half a cache line, so it's fairly likely to cross into the next one.

So a more-predictable sequence of cache-line requests seen by the L2 cache is a plausible explanation for the memcmp version being faster. (Intel CPUs put the main prefetchers in the L2.)

If you compile with optimization, that 10-byte memcmp inlines and in the inner loop only touches 8 bytes. (If they match, then it jumps to another block where it checks the last 2. But it's exceedingly unlikely that ever happens).

This might be why -O0 is faster than -O3 on my system, with GCC11.1 on i7-6700k Skylake with DDR4-2666 DRAM, on Arch GNU/Linux. (Your question didn't bother to specify which compiler and options your numbers were from, or which hardware.)

$ gcc -O3 array-stride.c
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,mem_load_retired.l1_hit,mem_load_retired.l1_miss -r 2 ./a.out 
No.1: 0.041710 s
No.2: 0.063072 s
No.1: 0.040457 s
No.2: 0.061166 s

 Performance counter stats for './a.out' (2 runs):

          1,843.71 msec task-clock                #    0.999 CPUs utilized            ( +-  0.14% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
            73,300      page-faults               #   39.757 K/sec                    ( +-  0.00% )
     6,607,078,798      cycles                    #    3.584 GHz                      ( +-  0.06% )
    18,614,303,386      instructions              #    2.82  insn per cycle           ( +-  0.00% )
    19,407,689,229      uops_issued.any           #   10.526 G/sec                    ( +-  0.02% )
    21,970,261,576      uops_executed.thread      #   11.916 G/sec                    ( +-  0.02% )
     5,408,090,087      mem_load_retired.l1_hit   #    2.933 G/sec                    ( +-  0.00% )
         3,861,687      mem_load_retired.l1_miss  #    2.095 M/sec                    ( +-  2.11% )

           1.84537 +- 0.00135 seconds time elapsed  ( +-  0.07% )

$ grep . /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference:balance_performance
  ... same on all 8 logical cores

The perf counter data is basically meaningless; it's for the whole run time, not the timed regions (which are about 0.1 sec out of 1.8 sec), so the majority of the time spent here was in glibc random(3) which uses "a nonlinear additive feedback random number generator employing a default table of size 31 long integers". Also in page-faults on the big malloc region.

It's only interesting in terms of the delta between two builds, but the loops outside the timed region still contribute many extra uops so it's still not as interesting as one would like.

vs. gcc -O0: No.1 is faster, No.2 is as expected a bit slower, with -O0 putting a store/reload into the dependency chain involving i.

$ gcc -O0 array-stride.c
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,mem_load_retired.l1_hit,mem_load_retired.l1_miss -r 2 ./a.out 
No.1: 0.028402 s
No.2: 0.076405 s
No.1: 0.028079 s
No.2: 0.069492 s

 Performance counter stats for './a.out' (2 runs):

          1,979.57 msec task-clock                #    0.999 CPUs utilized            ( +-  0.04% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
            66,656      page-faults               #   33.672 K/sec                    ( +-  0.00% )
     7,252,728,414      cycles                    #    3.664 GHz                      ( +-  0.02% )
    20,507,166,672      instructions              #    2.83  insn per cycle           ( +-  0.01% )
    22,268,130,378      uops_issued.any           #   11.249 G/sec                    ( +-  0.00% )
    25,117,638,171      uops_executed.thread      #   12.688 G/sec                    ( +-  0.00% )
     6,640,523,801      mem_load_retired.l1_hit   #    3.355 G/sec                    ( +-  0.01% )
         3,350,518      mem_load_retired.l1_miss  #    1.693 M/sec                    ( +-  1.39% )

         1.9810591 +- 0.0000934 seconds time elapsed  ( +-  0.00% )

Remember this profiling is for the total run time, not the timed regions, so the 2.83 IPC is not for the timed region.

Out of order exec hiding cost of independent work

The actual throughput cost of running that macro-fused cmp/je uop doesn't add to any bottleneck, thanks to out-of-order exec. Or even a whole call memcmp@plt and setting up the args. The bottleneck is latency, not the front-end, or load ports in the back end, and the OoO exec window is deep enough to hide the memcmp work. See also the following to understand modern CPUs. (Yes, it will take a lot of reading to wrap your head around them.)

The loop-carried dependency through i goes i -> tex[i] -> indirectly back into i for next iteration, monotonically increasing it. tex[] is too big to fit in cache, and HW prefetch won't keep up with those strides.

So DRAM bandwidth might be the limiting factor, or even DRAM latency if HW prefetch doesn't detect and lock on to the somewhat-sequential access pattern to consecutive or every-other cache line.

The actual branch predicts perfectly, so it can execute (and confirm the prediction) when the load result arrives, in parallel with the stuff waiting for the sign-extending byte load that started at the same place.

and the execution times are exactly the same. ... it actually becomes a lot faster.

Huh? Your times don't show exactly the same. Yes, it does become measurably faster with that memcmp.

In theory, it should be slower.

Only if your theory is too simplistic to model modern CPUs. Not slowing it down is easily explainable because there is spare out-of-order exec throughput to do that independent work while waiting for the load latency.

What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
How many CPU cycles are needed for each assembly instruction? (that's not how performance works; CPUs overlap the execution of multiple instructions; there is no single number cost you can assign to each instruction and then add them up).

Also possible related for -O0 performance:

Adding a redundant assignment speeds up code when compiled without optimization - Sandybridge-family store forwarding has variable latency, and not trying too soon can actually let you achieve better latency, i.e. less of a bottleneck. But I think the major reason -O0 memcmp is faster is the prefetch effect from doing a wider load.

For reference, the with-memcmp inner loops from GCC on Godbolt, matching what I benchmarked on my desktop.

# gcc11.1 -O0 -fPIE
.L10:                                           # do{
        mov     eax, DWORD PTR -16[rbp]
        movsx   rdx, eax
        mov     rax, QWORD PTR -32[rbp]
        lea     rcx, [rdx+rax]
        mov     rax, QWORD PTR -40[rbp]
        mov     edx, 10
        mov     rsi, rax
        mov     rdi, rcx
        call    memcmp@PLT
        test    eax, eax
        jne     .L9                             # normally taken jump over puts
        lea     rax, .LC0[rip]
        mov     rdi, rax
        call    puts@PLT
.L9:
        mov     eax, DWORD PTR -16[rbp]
        movsx   rdx, eax
        mov     rax, QWORD PTR -32[rbp]
        add     rax, rdx                       # strange not using an addressing mode, but ok
        movzx   eax, BYTE PTR [rax]
        movsx   eax, al                        # GCC -O0 is dumb,
        cdqe                                   # really dumb.
        mov     eax, DWORD PTR -576[rbp+rax*4]   # look up in rand_array[]
        add     DWORD PTR -16[rbp], eax          # i += ...
.L8:
        cmp     DWORD PTR -16[rbp], 299999999
        jle     .L10                             # }while(i<300000000)

vs. the -O3 code:

# RBP holds  char *tex at at this point
.L10:                                        # do{
        movsx   r12, r13d                     # sign-extend i
        mov     rax, QWORD PTR [rbx]          # str[0..7] gets reloaded because alias analysis and the possible call to puts defeated the optimizer.  Avoiding malloc for it may have helped.
        add     r12, rbp                      # i+tex to avoid indexed addressing modes later?  Probably not optimal
        cmp     QWORD PTR [r12], rax          # first 8 bytes of the memcmp
        je      .L18                          # code at .L18 checks next 2 and maybe does puts, before jumping back
.L5:
        movsx   rax, BYTE PTR [r12]           # sign-extending byte load to pointer width, of tex[i]
        add     r13d, DWORD PTR [rsp+rax*4]   # i += look up in rand_array[]
        cmp     r13d, 299999999
        jle     .L10                          # }while(i < 300000000)

Since the je .L18 is never taken, this is 7 uops per iteration, so Skylake can comfortable issue it at under 2 cycles per iteration.

Even with L1d cache hits, the loop-carried dependency through i (R13D) is:

1 cycle: movsx r12, r13d
1 cycle: add r12, rbp
~5 cycle load-use latency: movsx rax, BYTE PTR [r12]. (Not 4c, that optimistic special case only applies when the reg in the simple addressing mode was itself the direct result of a load.)
~5 cycle load-use latency plus 1 cycle ALU latency: add r13d, DWORD PTR [rsp+rax*4]

So a total of about 13 cycle latency best case with L1d hits, leaving tons of free "slots" in the front-end, and idle back-end execution units, that even a call to actual glibc memcmp would be no big deal.

(Of course the -O0 code is noisier so some of those free slots in the pipeline area already used up, but the dep chain is even longer because -O0 code keeps i in memory: Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?)

CPU frequency in memory-bottlenecked loops, especially on Skylake

The init loops provide plenty of time for the CPU to get to max turbo before the timed region, and to pre-fault the allocated pages by touching them first. Idiomatic way of performance evaluation?

There might also be an effect of keeping CPU frequency higher if there's more work, less stalling, during those waits for incoming data from cache miss loads. See Slowing down CPU Frequency by imposing memory stress.

In the above results, I tested with EPP = balance_performance.

I also tested with performance, and -O0 (libc memcmp) is still faster than -O3 (inlined memcmp), but all 4 loops did speed up some (with/without memcmp, and optimized or not).

compiler/CPU	No.1 `if(memcmp)`	No.2 `plain`
`-O0` / `balance_performance`	0.028079 s	0.069492 s
`-O0` / `performance`	0.026898 s	0.053805 s
`-O3` / `balance_performance`	0.040457 s	0.061166 s
`-O3` / `performance`	0.035621 s	0.047475 s

The faster-at--O0 effect is still present and very significant even with EPP=performance, so we know it's not just a matter of clock speed. From previous experiments, performance keeps it pegged at max turbo even in memory-bound conditions where other settings would clock down from 4.2 to 2.7 GHz. So it's very likely that calling memcmp is helping trigger better prefetching to reduce average cache-miss latency.

I unfortunately didn't use a fixed seed for the PRNG so there may be some variation due to the randomness being good vs. bad for the prefetcher, in terms of patterns of whole cache lines accessed. And I just took one specific run for each (the 2nd one in the pair started by perf stat -r 2, so it should hopefully be even less affected by system fluctuation.)

Looks like performance made a bigger difference to the No.2 loop (with less going on in it), and also to the -O3 version (again with less going on in it), which is consistent with Skylake lowering clock speed when a core is pretty much only bottlenecking on memory, not running many other instructions, in EPP settings other than performance.

Also forgot to mention CPU frequency ramp up. The init loops give time for that to happen. [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) — Peter Cordes, Sep 17 '21 at 02:14
Hmm, more work in the memory-bottleneck loop may actually be helping Skylake decide not to ramp down the CPU frequency during that portion, if you don't have EPP set to `energy_performance_preference = performance` (rather than balance_power or balance_performance). [Slowing down CPU Frequency by imposing memory stress](https://stackoverflow.com/q/63399456) — Peter Cordes, Sep 17 '21 at 02:15

score -1 · Answer 2 · answered Sep 17 '21 at 01:00

-1

Short answer, and pure speculation: In the first timed loop, the buffer pointed to by tex is still in cache after initialization. The printf does a lot of complicated stuff, including kernel calls, and any cache-lines are now very likely filled with other data.

The second timed loop now has to re-read the data from RAM.

Try to swap the loops, and/or delay the printf to after both measurements.

Oh, and the call to memcmp has a potential buffer overflow.

answered Sep 17 '21 at 01:00

HAL9000

2,138
1
9
20

Very unlikely that there are cache hits in either case; `tex[]` is ~300MB vs. even a big Xeon only has about 50MiB of L3, and a desktop more like 8MiB. The last writes of `random()` results were at the end, but the first timed loop loop starts reading it at the start. Also, lazy dynamic linking of `clock()` does a whole bunch of stuff between init and the first timed loop. (Compile with `-fno-plt` to use early binding, so just the code for `clock()` itself has to run.) – Peter Cordes Sep 17 '21 at 01:07
You're right about `memcmp` potentially reading up to 9 bytes past the end of tex[], though; I noticed that too but there wasn't a good spot in my already-long answer to mention bugs that weren't part of the perf effect. – Peter Cordes Sep 17 '21 at 01:08
@PeterCordes, as I said, "Pure speculation", feel free to downvote. – HAL9000 Sep 17 '21 at 01:13
Printing after both loops is probably better benchmark methodology in general so I won't actually downvote; reserving that for answers where the guesswork is based on misconceptions. On some OSes, I could even imagine waiting for a TTY write to finish might lead to CPU frequency dropping down to idle, or to migrating the process to another core if it wakes up the terminal handler on this one. (Linux with an X11 terminal emulator doesn't do that, but if anyone's testing on Windows, its console output might be a lot slower.) – Peter Cordes Sep 17 '21 at 02:17

Why does adding an if(!memcmp()) speed up a loop that makes random short strides through a huge byte array?

2 Answers2

Out of order exec hiding cost of independent work

CPU frequency in memory-bottlenecked loops, especially on Skylake

Linked