33% less instructions, 17% less mem access, but 4x as fast?

Question

Summary

I have two pieces of c++ code which do the same calculation. Code B does result in about 33% less instructions, about 17% less memory access than code A, but runs four times as fast (instead of two). What would be the cause? Moreover, how would we able to confirm the claims provided by your answers?

In both codes,

howmany is 20 000 000
testees has 20 000 000 elements, randomly generated(mt19937) at startup(before these snippets) for each of code A and code B.
multiplication is handled by a single access to memory(as to be seen in assembly code later)
Both codes were compiled with optimization flag -O1

Some code

code A - runs in approx. 95 to 110 ms

    GF2 sum {GF2(1)};
    auto a = system_clock::now();
    for(size_t i=0;i<howmany;i++){
        sum *= testees[i]; 
    }
    auto b = system_clock::now();

code B - runs in approx. 25 to 30 ms

    GF2 sum1 {GF2(1)};
    GF2 sum2 {GF2(1)};
    GF2 sum3 {GF2(1)};
    GF2 sum4 {GF2(1)};
    auto aa = system_clock::now();
    for(size_t i=0;i<howmany;i+=4){
        sum1 *= testees[i];   // runs sum1.rep = multTable[sum1.rep][testees[i].rep]
        sum2 *= testees[i+1]; // GF2::rep is of type std::uint8_t
        sum3 *= testees[i+2]; // multTable is a std::uint8_t array of size 256 * 256; 64KB in total.
        sum4 *= testees[i+3];
    }
    sum1 = sum1*sum2*sum3*sum4;
    auto bb = system_clock::now();

testees is a std::vector filled with 20M random GF2 instances. 5_3_betterbenchmark.cpp randomly generates them in runtime, so code A and B runs with identical testees.

code A, main loop (which runs 20M times, which makes 20M * 9 = 180M instructions total)

01 .L170:
02     movzbl  15(%rsp), %eax  # on 15(%rsp) is sum.rep -> mem access #1
03     salq    $8, %rax        # left shift sum.rep(in %rax) by 8 bits
04     addq    %rsi, %rax      # on %rsi is the address for multTable
05     movzbl  (%rdx), %ecx    # on (%rdx) is testees[i].rep -> mem access #2
06     movzbl  (%rax,%rcx), %eax  # (%rax,%rcx) is multTable[sum.rep][testees[i].rep] -> mem access #3
07     movb    %al, 15(%rsp)   # moves this to sum.rep -> mem access #4
08     addq    $1, %rdx        # i+=1 but in pointers
09     cmpq    %rdi, %rdx      # loop condition check
10     jne .L170

code B, main loop (which runs 5M times, which makes 5M * 24 = 120M instructions total)

01 .L171:
02     movzbl  14(%rsp), %r8d   # sum1, -> mem access #1
03     salq    $8, %r8 
04     addq    %rdi, %r8  
05     movzbl  (%rsi), %r10d    # -> mem access #2
06     movzbl  (%r8,%r10), %r8d # -> mem access #3  
07     movb    %r8b, 14(%rsp)   # -> mem access #4
08     movzbl  %cl, %ecx        # sum2 is already in register for some reason
09     salq    $8, %rcx    
10     addq    %rdi, %rcx  
11     movzbl  1(%rsi), %r10d   # -> mem access #5
12     movzbl  (%rcx,%r10), %ecx# -> mem access #6  
13     movzbl  %dl, %edx        # sum3 is also already in register for some reason
14     salq    $8, %rdx    
15     addq    %rdi, %rdx  
16     movzbl  2(%rsi), %r10d   # -> mem access #7
17     movzbl  (%rdx,%r10), %edx# -> mem access #8   
18     movzbl  %al, %eax        # sum4 is also already in register for some reason
19     salq    $8, %rax    
20     addq    %rdi, %rax  
21     movzbl  3(%rsi), %r10d   # -> mem access #9
22     movzbl  (%rax,%r10), %eax# -> mem access #10  
23     addq    $4, %rsi    # i+=4 (in pointers)
24     cmpq    %r9, %rsi   # decide if we should terminate loop
25     jne .L171

Hypotheses

Here are some hypotheses me and my colleagues came up with, after Googling and thinking hard: however I am hardly convinced.

Claim: calls to std::chrono::system_clock() were misplaced by compiler optimizations
- After a look at the whole assembly code we confirmed this was not the case
Claim : code A is more prone to TLB misses when process switches happen and the TLB is wiped out (according to how Linux kernels are implemented)
- However, the number of context switches during both runs were virtually the same, around only 6 in most cases (checked with sudo perf stat for both), and the number of context switches had little to no effect in measured time.
Claim: Code B results in less cache misses
- However, testees is accessed in an exactly identical order in both codes, and multTable is accessed 20M times in random order in both - which means the amount of cache misses in both codes should be about the same
- Moreover, because sum.rep/sum1.rep are accessed very frequently, any sane cache policy would leave these on L1 cache - resulting in very few cache misses from accessing sum.rep.

Edited after question was closed

This is, largely, a report after checking out the things suggested by the comments.

some clarification

User Aki Suihkonen, in the comments, pointed out that this test is inherently biased, as I might not have prevented GF2(0) from being mixed in. As a matter of fact I have prevented this from the start(for more details please see my comment below). I believe my question was a perfectly valid one, perhaps aside from the fact that it had answers elsewhere on the site.

Aftermath

My questions were actually two questions in one.

Why is code B (4x) faster than code A
How do I check your claim is true?

The only comment that addressed the second question is user mattlangford's. mattlangford suggested that I check out llvm-mca, a cool tool that also lets me visualize the execution(from dispatch to retire) in a timeline fashion. However, only after being presented with some unsatisfactory results, I noticed these sentences on the guide:

The LSUnit does not know when store-to-load forwarding may occur.

The LSUnit does not know anything about cache hierarchy and memory types.

The LSUnit does not know how to identify serializing operations and memory fences.

Also, the results of llvm-mca for code A look something like this (command used is llvm-mca (input) -iterations=200 -timeline -o 630_llvmmca.txt):

Timeline view:
                    0123456789          012345
Index     0123456789          0123456789

[0,0]     DeeeeeER  .    .    .    .    .    .   movzbl 14(%rsp), %eax
[0,1]     D=====eER .    .    .    .    .    .   shlq   $8, %rax
[0,2]     D======eER.    .    .    .    .    .   addq   %rsi, %rax
[0,3]     DeeeeeE--R.    .    .    .    .    .   movzbl (%rdx), %ecx
[0,4]     .D======eeeeeER.    .    .    .    .   movzbl (%rax,%rcx), %eax
[0,5]     .D===========eER    .    .    .    .   movb   %al, 14(%rsp)
[0,6]     .DeE-----------R    .    .    .    .   addq   $1, %rdx
[0,7]     .D=eE----------R    .    .    .    .   cmpq   %rdi, %rdx
[0,8]     . D=eE---------R    .    .    .    .   jne    .L171
[1,0]     . DeeeeeE------R    .    .    .    .   movzbl 14(%rsp), %eax
[1,1]     . D=====eE-----R    .    .    .    .   shlq   $8, %rax
[1,2]     . D======eE----R    .    .    .    .   addq   %rsi, %rax
[1,3]     .  DeeeeeE-----R    .    .    .    .   movzbl (%rdx), %ecx
[1,4]     .  D======eeeeeER   .    .    .    .   movzbl (%rax,%rcx), %eax
[1,5]     .  D===========eER  .    .    .    .   movb   %al, 14(%rsp)
[1,6]     .  DeE-----------R  .    .    .    .   addq   $1, %rdx
[1,7]     .   DeE----------R  .    .    .    .   cmpq   %rdi, %rdx
[1,8]     .   D=eE---------R  .    .    .    .   jne    .L171
[2,0]     .   DeeeeeE------R  .    .    .    .   movzbl 14(%rsp), %eax
[2,1]     .   D=====eE-----R  .    .    .    .   shlq   $8, %rax
[2,2]     .    D=====eE----R  .    .    .    .   addq   %rsi, %rax
[2,3]     .    DeeeeeE-----R  .    .    .    .   movzbl (%rdx), %ecx

and so on. At least according to this diagram, four instructions are already being dispatched in parallel. I checked the llvm-mca output for code B as well:

[0,0]     DeeeeeER  .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   movzbl     14(%rsp), %r8d
[0,1]     D=====eER .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   shlq       $8, %r8
[0,2]     D======eER.    .    .    .    .    .    .    .    .    .    .    .    .    .   .   addq       %rdi, %r8
[0,3]     DeeeeeE--R.    .    .    .    .    .    .    .    .    .    .    .    .    .   .   movzbl     (%rsi), %r10d
[0,4]     .D======eeeeeER.    .    .    .    .    .    .    .    .    .    .    .    .   .   movzbl     (%r8,%r10), %r8d
[0,5]     .D===========eER    .    .    .    .    .    .    .    .    .    .    .    .   .   movb       %r8b, 14(%rsp)
[0,6]     .DeE-----------R    .    .    .    .    .    .    .    .    .    .    .    .   .   movzbl     %cl, %ecx
[0,7]     .D=eE----------R    .    .    .    .    .    .    .    .    .    .    .    .   .   shlq       $8, %rcx
[0,8]     . D=eE---------R    .    .    .    .    .    .    .    .    .    .    .    .   .   addq       %rdi, %rcx
[0,9]     . DeeeeeE------R    .    .    .    .    .    .    .    .    .    .    .    .   .   movzbl     1(%rsi), %r10d
[0,10]    . D=====eeeeeE-R    .    .    .    .    .    .    .    .    .    .    .    .   .   movzbl     (%rcx,%r10), %ecx
[0,11]    . DeE----------R    .    .    .    .    .    .    .    .    .    .    .    .   .   movzbl     %dl, %edx
[0,12]    .  DeE---------R    .    .    .    .    .    .    .    .    .    .    .    .   .   shlq       $8, %rdx
[0,13]    .  D=eE--------R    .    .    .    .    .    .    .    .    .    .    .    .   .   addq       %rdi, %rdx
[0,14]    .  DeeeeeE-----R    .    .    .    .    .    .    .    .    .    .    .    .   .   movzbl     2(%rsi), %r10d
[0,15]    .  D=====eeeeeER    .    .    .    .    .    .    .    .    .    .    .    .   .   movzbl     (%rdx,%r10), %edx
[0,16]    .   DeE--------R    .    .    .    .    .    .    .    .    .    .    .    .   .   movzbl     %al, %eax
[0,17]    .   D=eE-------R    .    .    .    .    .    .    .    .    .    .    .    .   .   shlq       $8, %rax
[0,18]    .   D==eE------R    .    .    .    .    .    .    .    .    .    .    .    .   .   addq       %rdi, %rax
[0,19]    .   DeeeeeE----R    .    .    .    .    .    .    .    .    .    .    .    .   .   movzbl     3(%rsi), %r10d
[0,20]    .    D====eeeeeER   .    .    .    .    .    .    .    .    .    .    .    .   .   movzbl     (%rax,%r10), %eax
[0,21]    .    DeE--------R   .    .    .    .    .    .    .    .    .    .    .    .   .   addq       $4, %rsi
[0,22]    .    D=eE-------R   .    .    .    .    .    .    .    .    .    .    .    .   .   cmpq       %r9, %rsi
[0,23]    .    D==eE------R   .    .    .    .    .    .    .    .    .    .    .    .   .   jne        .L171
[1,0]     .    .DeeeeeE---R   .    .    .    .    .    .    .    .    .    .    .    .   .   movzbl     14(%rsp), %r8d
[1,1]     .    .D=====eE--R   .    .    .    .    .    .    .    .    .    .    .    .   .   shlq       $8, %r8
[1,2]     .    .D======eE-R   .    .    .    .    .    .    .    .    .    .    .    .   .   addq       %rdi, %r8
[1,3]     .    .DeeeeeE---R   .    .    .    .    .    .    .    .    .    .    .    .   .   movzbl     (%rsi), %r10d
[1,4]     .    . D======eeeeeER    .    .    .    .    .    .    .    .    .    .    .   .   movzbl     (%r8,%r10), %r8d
[1,5]     .    . D===========eER   .    .    .    .    .    .    .    .    .    .    .   .   movb       %r8b, 14(%rsp)
[1,6]     .    . D=====eE------R   .    .    .    .    .    .    .    .    .    .    .   .   movzbl     %cl, %ecx

Frankly, not any different from code A, except now there are 4x less writes to memory - which should not really matter, because any sane cache policy should leave whatever is on 14(%rbx) on L1 cache. (from what I know, all intel-compatible CPUs for the past few decades employed write-back caches )

More tests

User Margaret Bloom claimed that

If you unroll more and more you'll see less and less increment in performance.

This claim proved to be true. I unrolled code B a further 2x, and saw a further 2x increase in speed. And after this, I failed to gain any further speedup.

Code C

GF2 sum1 {GF2(1)};
    GF2 sum2 {GF2(1)};
    GF2 sum3 {GF2(1)};
    GF2 sum4 {GF2(1)};
    GF2 sum5 {GF2(1)};
    GF2 sum6 {GF2(1)};
    GF2 sum7 {GF2(1)};
    GF2 sum8 {GF2(1)};


    auto aa = system_clock::now();
    for(size_t i=0;i<howmany;i+=8){
        sum1 *= testees[i];
        sum2 *= testees[i+1];
        sum3 *= testees[i+2];
        sum4 *= testees[i+3];
        sum5 *= testees[i+4];
        sum6 *= testees[i+5];
        sum7 *= testees[i+6];
        sum8 *= testees[i+7];

    }

Code C - assemblies for main loop

  1532     movzbl  14(%rsp), %ebp
  1533     salq    $8, %rbp
  1534     addq    %r11, %rbp
  1535     movzbl  (%rax), %r13d
  1536     movzbl  0(%rbp,%r13), %ebp
  1537     movb    %bpl, 14(%rsp)
  1538     movzbl  %r10b, %r10d
  1539     salq    $8, %r10
  1540     addq    %r11, %r10
  1541     movzbl  1(%rax), %r13d
  1542     movzbl  (%r10,%r13), %r10d
  1543     movzbl  %r9b, %r9d
  1544     salq    $8, %r9
  1545     addq    %r11, %r9
  1546     movzbl  2(%rax), %r13d
  1547     movzbl  (%r9,%r13), %r9d
  1548     movzbl  %r8b, %r8d
  1549     salq    $8, %r8
  1550     addq    %r11, %r8
  1551     movzbl  3(%rax), %r13d
  1552     movzbl  (%r8,%r13), %r8d
  1553     movzbl  %dil, %edi
  1554     salq    $8, %rdi
  1555     addq    %r11, %rdi
  1556     movzbl  4(%rax), %r13d
  1557     movzbl  (%rdi,%r13), %edi
  1558     movzbl  %sil, %esi
  1559     salq    $8, %rsi
  1560     addq    %r11, %rsi
  1561     movzbl  5(%rax), %r13d
  1562     movzbl  (%rsi,%r13), %esi
  1563     movzbl  %cl, %ecx
  1564     salq    $8, %rcx
  1565     addq    %r11, %rcx
  1566     movzbl  6(%rax), %r13d
  1567     movzbl  (%rcx,%r13), %ecx
  1568     movzbl  %dl, %edx
  1569     salq    $8, %rdx
  1570     addq    %r11, %rdx
  1571     movzbl  7(%rax), %r13d
  1572     movzbl  (%rdx,%r13), %edx
  1573     addq    $8, %rax
  1574     cmpq    %r12, %rax
  1575     jne .L171

Tentative conclusion & some issues I have with it

Gathering all the information in one place, this is the conclusion I could make.

According to llvm-mca, it is possible to dispatch four instructions on each cycle, even for code A - given that the memory latency is actually as short as presented in the diagrams above.
However, any memory latency information provided by llvm-mca may be gravely incorrect, as it "has no knowledge of the cache structure or memory types" - as quoted above.
Now, when actually considering the memory hierarchy, latency predictions made by llvm-mca would be gravely incorrect for movzbls that fetch from multTable[sum.rep][other.rep] - all others should be on cache, if my cache is sane. These movzbls will result in a cache miss in many cases - and might take hundreds of cycles.
code B removes some dependencies - a.k.a hazards, by writing the assembly loop of code A four times and renaming some registers/memories.
If I try to draw a diagram similar to the ones generated by llvm-mca above, but this time taking into consideration cache misses, I should be getting a rough estimate of the total running time...?

I am still puzzled

If my CPU could dispatch only four instructions at a time, what could explain the 8x speedup of code C, compared to code A?

Future plans

Now this is clearly becoming a question unfit for stackoverflow - I will ask another question elsewhere - or look it up more.

Thank you very much for providing valuable insights!

This is likely due to out of order execution. With your B variant, the four additions in the loop can be evaluated by four execution units in parallel (alternatively, vectorisation can be used). With variant A, this is not possible. So the total latency is reduced by a factor of 4 as you observed. — fuz, May 13 '21 at 10:36
@SamVarshavchik: As stated in the question, OP is using `-O1` in both cases. — Andreas Wenzel, May 13 '21 at 10:41
@fuz To my knowledge, `-O3` option supports unrolling loops, like you have suggested - but even with O3 there is no marked decrese in execution speed for code A. — Jayeon Yi, May 13 '21 at 10:48
By unrolling the loop, you have nine (if I eye-balled correctly) dependency chains and the CPU can use more execution units through OoO dispatch. It still can only do two loads and one (or two) stores per cycle but there are plenty of dependency chains to absorb any bubble. There is no reason to expect the code to be only twice as fast (note that you can't do 33%+17% as these sets are not disjoint). If you unroll more and more you'll see less and less increment in performance. — Margaret Bloom, May 13 '21 at 10:54
Do the numbers change, when you first run A and then B, or first B and then A? Are these separate executables or both run in one? — Olaf Dietsche, May 13 '21 at 11:28
@Olaf Dietsche I've tried all three of what you said - separate executables, A then B(same numbers), B then A(same numbers) — Jayeon Yi, May 13 '21 at 11:37
@JayeonYi I'm not sure how unrolling is supposed to be relevant. — fuz, May 13 '21 at 11:45
This test is biased -- it doesn't measure generic GF2 multiplication speed, since 0 * A = 0. As soon as a random value of 0 is encountered all the memory access for that particular dependency chain concentrates to an area of 256 bytes. (And there are better ways to calculate the product of 20'000'000 elements in GF2 -- starting with a histogram...) — Aki Suihkonen, May 13 '21 at 12:20
@AkiSuihkonen You are right: however, I would like to mention that I took the utmost care not to include a 0 in the 20 000 000 numbers. You are right about the histogram, but I simply wanted a benchmark on how fast operator*= would work. The full code is available here: https://github.com/stet-stet/many-gf2n-cpp/blob/main/5_3_betterbenchmark.cpp — Jayeon Yi, May 13 '21 at 12:52
LLVM has a cool tool that lets you see the execution in a timeline view that might be worth checking out: https://llvm.org/docs/CommandGuide/llvm-mca.html The difference is almost certainly due to the fact you're able to do 4 multiplies in "parallel" and do 1/4th the number of jumps. Additionally here is a microbenchmark that confirms your findings: https://quick-bench.com/q/1aevPJXBLrZ3JJ77aMQaEOJ1jlE — mattlangford, May 13 '21 at 13:37
@JayeonYi The important part is that independent calculations can execute independently. In (A), you have one big sum where each step depends on all previous steps. (B) has four separate sums that can be calculated in parallel, giving a 4 times speedup. — fuz, May 13 '21 at 14:05
@JayeonYi: Basically the same effect as [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) - unrolling FP loops with multiple accumulators to hide latency and only bottleneck on throughput. In your case, it's store/reload (store forwarding) ~5 cycle latency because something (perhaps aliasing) is stopping the compiler from keeping `sum` in a register. Or possibly because you only used `-O1` — Peter Cordes, May 13 '21 at 16:13
Re: store-forwarding latency: [Adding a redundant assignment speeds up code when compiled without optimization](https://stackoverflow.com/a/49190421) — Peter Cordes, May 13 '21 at 16:17
Also: [Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell](https://stackoverflow.com/q/21090873) is still about FP latency, but the same concepts apply. — Peter Cordes, May 13 '21 at 16:27
in general less instructions does not mean faster, nor does less memory accesses, one would hope if you have less of both it is faster, but that is not something safe to assume. I could have 100 less instructions and one additional memory access and have performance be a wash it it has to go out to dram for that access. — old_timer, May 16 '21 at 02:49
The actual critical problem with using LLVM-MCA for this is *does not know when store-to-load forwarding may occur.* - when store/reload is part of a loop-carried dependency chain (like in your case), LLVM-MCA thinks the load is actually independent! So it simulates way more instruction-level parallelism than is really possible. As [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899) (and one of the linked duplicates) points out, analyzing dependency chains (by hand or with tools like LLVM-MCA) is the starting point to static perf analysis. — Peter Cordes, May 18 '21 at 16:16
Your sum1..8 variables will stay hot in L1d cache; modeling the cache hierarchy isn't a problem. It's modeling store-forwarding (and detecting the dependency) that matters. Some loops do `a[i] += 1;` or something so load/store instructions in one iteration are using a different address than the previous iteration. If you could convince the compiler not to spill/reload your variables, that would shorten the latency you have to hide, so you'd need less unrolling. — Peter Cordes, May 18 '21 at 16:19