Summary
I have two pieces of c++ code which do the same calculation. Code B does result in about 33% less instructions, about 17% less memory access than code A, but runs four times as fast (instead of two). What would be the cause? Moreover, how would we able to confirm the claims provided by your answers?
In both codes,
howmany
is 20 000 000testees
has 20 000 000 elements, randomly generated(mt19937
) at startup(before these snippets) for each of code A and code B.- multiplication is handled by a single access to memory(as to be seen in assembly code later)
- Both codes were compiled with optimization flag
-O1
Some code
code A - runs in approx. 95 to 110 ms
GF2 sum {GF2(1)};
auto a = system_clock::now();
for(size_t i=0;i<howmany;i++){
sum *= testees[i];
}
auto b = system_clock::now();
code B - runs in approx. 25 to 30 ms
GF2 sum1 {GF2(1)};
GF2 sum2 {GF2(1)};
GF2 sum3 {GF2(1)};
GF2 sum4 {GF2(1)};
auto aa = system_clock::now();
for(size_t i=0;i<howmany;i+=4){
sum1 *= testees[i]; // runs sum1.rep = multTable[sum1.rep][testees[i].rep]
sum2 *= testees[i+1]; // GF2::rep is of type std::uint8_t
sum3 *= testees[i+2]; // multTable is a std::uint8_t array of size 256 * 256; 64KB in total.
sum4 *= testees[i+3];
}
sum1 = sum1*sum2*sum3*sum4;
auto bb = system_clock::now();
testees
is a std::vector filled with 20M random GF2 instances. 5_3_betterbenchmark.cpp
randomly generates them in runtime, so code A and B runs with identical testees
.
code A, main loop (which runs 20M times, which makes 20M * 9 = 180M instructions total)
01 .L170:
02 movzbl 15(%rsp), %eax # on 15(%rsp) is sum.rep -> mem access #1
03 salq $8, %rax # left shift sum.rep(in %rax) by 8 bits
04 addq %rsi, %rax # on %rsi is the address for multTable
05 movzbl (%rdx), %ecx # on (%rdx) is testees[i].rep -> mem access #2
06 movzbl (%rax,%rcx), %eax # (%rax,%rcx) is multTable[sum.rep][testees[i].rep] -> mem access #3
07 movb %al, 15(%rsp) # moves this to sum.rep -> mem access #4
08 addq $1, %rdx # i+=1 but in pointers
09 cmpq %rdi, %rdx # loop condition check
10 jne .L170
code B, main loop (which runs 5M times, which makes 5M * 24 = 120M instructions total)
01 .L171:
02 movzbl 14(%rsp), %r8d # sum1, -> mem access #1
03 salq $8, %r8
04 addq %rdi, %r8
05 movzbl (%rsi), %r10d # -> mem access #2
06 movzbl (%r8,%r10), %r8d # -> mem access #3
07 movb %r8b, 14(%rsp) # -> mem access #4
08 movzbl %cl, %ecx # sum2 is already in register for some reason
09 salq $8, %rcx
10 addq %rdi, %rcx
11 movzbl 1(%rsi), %r10d # -> mem access #5
12 movzbl (%rcx,%r10), %ecx# -> mem access #6
13 movzbl %dl, %edx # sum3 is also already in register for some reason
14 salq $8, %rdx
15 addq %rdi, %rdx
16 movzbl 2(%rsi), %r10d # -> mem access #7
17 movzbl (%rdx,%r10), %edx# -> mem access #8
18 movzbl %al, %eax # sum4 is also already in register for some reason
19 salq $8, %rax
20 addq %rdi, %rax
21 movzbl 3(%rsi), %r10d # -> mem access #9
22 movzbl (%rax,%r10), %eax# -> mem access #10
23 addq $4, %rsi # i+=4 (in pointers)
24 cmpq %r9, %rsi # decide if we should terminate loop
25 jne .L171
Hypotheses
Here are some hypotheses me and my colleagues came up with, after Googling and thinking hard: however I am hardly convinced.
- Claim: calls to
std::chrono::system_clock()
were misplaced by compiler optimizations- After a look at the whole assembly code we confirmed this was not the case
- Claim : code A is more prone to TLB misses when process switches happen and the TLB is wiped out (according to how Linux kernels are implemented)
- However, the number of context switches during both runs were virtually the same, around only 6 in most cases (checked with
sudo perf stat
for both), and the number of context switches had little to no effect in measured time.
- However, the number of context switches during both runs were virtually the same, around only 6 in most cases (checked with
- Claim: Code B results in less cache misses
- However,
testees
is accessed in an exactly identical order in both codes, andmultTable
is accessed 20M times in random order in both - which means the amount of cache misses in both codes should be about the same - Moreover, because
sum.rep
/sum1.rep
are accessed very frequently, any sane cache policy would leave these on L1 cache - resulting in very few cache misses from accessingsum.rep
.
- However,
Edited after question was closed
This is, largely, a report after checking out the things suggested by the comments.
some clarification
User Aki Suihkonen, in the comments, pointed out that this test is inherently biased, as I might not have prevented GF2(0)
from being mixed in. As a matter of fact I have prevented this from the start(for more details please see my comment below). I believe my question was a perfectly valid one, perhaps aside from the fact that it had answers elsewhere on the site.
Aftermath
My questions were actually two questions in one.
- Why is code B (4x) faster than code A
- How do I check your claim is true?
The only comment that addressed the second question is user mattlangford's. mattlangford suggested that I check out llvm-mca, a cool tool that also lets me visualize the execution(from dispatch to retire) in a timeline fashion. However, only after being presented with some unsatisfactory results, I noticed these sentences on the guide:
- The LSUnit does not know when store-to-load forwarding may occur.
- The LSUnit does not know anything about cache hierarchy and memory types.
- The LSUnit does not know how to identify serializing operations and memory fences.
Also, the results of llvm-mca for code A look something like this (command used is llvm-mca (input) -iterations=200 -timeline -o 630_llvmmca.txt
):
Timeline view:
0123456789 012345
Index 0123456789 0123456789
[0,0] DeeeeeER . . . . . . movzbl 14(%rsp), %eax
[0,1] D=====eER . . . . . . shlq $8, %rax
[0,2] D======eER. . . . . . addq %rsi, %rax
[0,3] DeeeeeE--R. . . . . . movzbl (%rdx), %ecx
[0,4] .D======eeeeeER. . . . . movzbl (%rax,%rcx), %eax
[0,5] .D===========eER . . . . movb %al, 14(%rsp)
[0,6] .DeE-----------R . . . . addq $1, %rdx
[0,7] .D=eE----------R . . . . cmpq %rdi, %rdx
[0,8] . D=eE---------R . . . . jne .L171
[1,0] . DeeeeeE------R . . . . movzbl 14(%rsp), %eax
[1,1] . D=====eE-----R . . . . shlq $8, %rax
[1,2] . D======eE----R . . . . addq %rsi, %rax
[1,3] . DeeeeeE-----R . . . . movzbl (%rdx), %ecx
[1,4] . D======eeeeeER . . . . movzbl (%rax,%rcx), %eax
[1,5] . D===========eER . . . . movb %al, 14(%rsp)
[1,6] . DeE-----------R . . . . addq $1, %rdx
[1,7] . DeE----------R . . . . cmpq %rdi, %rdx
[1,8] . D=eE---------R . . . . jne .L171
[2,0] . DeeeeeE------R . . . . movzbl 14(%rsp), %eax
[2,1] . D=====eE-----R . . . . shlq $8, %rax
[2,2] . D=====eE----R . . . . addq %rsi, %rax
[2,3] . DeeeeeE-----R . . . . movzbl (%rdx), %ecx
and so on. At least according to this diagram, four instructions are already being dispatched in parallel. I checked the llvm-mca output for code B as well:
[0,0] DeeeeeER . . . . . . . . . . . . . . . movzbl 14(%rsp), %r8d
[0,1] D=====eER . . . . . . . . . . . . . . . shlq $8, %r8
[0,2] D======eER. . . . . . . . . . . . . . . addq %rdi, %r8
[0,3] DeeeeeE--R. . . . . . . . . . . . . . . movzbl (%rsi), %r10d
[0,4] .D======eeeeeER. . . . . . . . . . . . . . movzbl (%r8,%r10), %r8d
[0,5] .D===========eER . . . . . . . . . . . . . movb %r8b, 14(%rsp)
[0,6] .DeE-----------R . . . . . . . . . . . . . movzbl %cl, %ecx
[0,7] .D=eE----------R . . . . . . . . . . . . . shlq $8, %rcx
[0,8] . D=eE---------R . . . . . . . . . . . . . addq %rdi, %rcx
[0,9] . DeeeeeE------R . . . . . . . . . . . . . movzbl 1(%rsi), %r10d
[0,10] . D=====eeeeeE-R . . . . . . . . . . . . . movzbl (%rcx,%r10), %ecx
[0,11] . DeE----------R . . . . . . . . . . . . . movzbl %dl, %edx
[0,12] . DeE---------R . . . . . . . . . . . . . shlq $8, %rdx
[0,13] . D=eE--------R . . . . . . . . . . . . . addq %rdi, %rdx
[0,14] . DeeeeeE-----R . . . . . . . . . . . . . movzbl 2(%rsi), %r10d
[0,15] . D=====eeeeeER . . . . . . . . . . . . . movzbl (%rdx,%r10), %edx
[0,16] . DeE--------R . . . . . . . . . . . . . movzbl %al, %eax
[0,17] . D=eE-------R . . . . . . . . . . . . . shlq $8, %rax
[0,18] . D==eE------R . . . . . . . . . . . . . addq %rdi, %rax
[0,19] . DeeeeeE----R . . . . . . . . . . . . . movzbl 3(%rsi), %r10d
[0,20] . D====eeeeeER . . . . . . . . . . . . . movzbl (%rax,%r10), %eax
[0,21] . DeE--------R . . . . . . . . . . . . . addq $4, %rsi
[0,22] . D=eE-------R . . . . . . . . . . . . . cmpq %r9, %rsi
[0,23] . D==eE------R . . . . . . . . . . . . . jne .L171
[1,0] . .DeeeeeE---R . . . . . . . . . . . . . movzbl 14(%rsp), %r8d
[1,1] . .D=====eE--R . . . . . . . . . . . . . shlq $8, %r8
[1,2] . .D======eE-R . . . . . . . . . . . . . addq %rdi, %r8
[1,3] . .DeeeeeE---R . . . . . . . . . . . . . movzbl (%rsi), %r10d
[1,4] . . D======eeeeeER . . . . . . . . . . . . movzbl (%r8,%r10), %r8d
[1,5] . . D===========eER . . . . . . . . . . . . movb %r8b, 14(%rsp)
[1,6] . . D=====eE------R . . . . . . . . . . . . movzbl %cl, %ecx
Frankly, not any different from code A, except now there are 4x less writes to memory - which should not really matter, because any sane cache policy should leave whatever is on 14(%rbx)
on L1 cache. (from what I know, all intel-compatible CPUs for the past few decades employed write-back caches )
More tests
User Margaret Bloom claimed that
If you unroll more and more you'll see less and less increment in performance.
This claim proved to be true. I unrolled code B a further 2x, and saw a further 2x increase in speed. And after this, I failed to gain any further speedup.
Code C
GF2 sum1 {GF2(1)};
GF2 sum2 {GF2(1)};
GF2 sum3 {GF2(1)};
GF2 sum4 {GF2(1)};
GF2 sum5 {GF2(1)};
GF2 sum6 {GF2(1)};
GF2 sum7 {GF2(1)};
GF2 sum8 {GF2(1)};
auto aa = system_clock::now();
for(size_t i=0;i<howmany;i+=8){
sum1 *= testees[i];
sum2 *= testees[i+1];
sum3 *= testees[i+2];
sum4 *= testees[i+3];
sum5 *= testees[i+4];
sum6 *= testees[i+5];
sum7 *= testees[i+6];
sum8 *= testees[i+7];
}
Code C - assemblies for main loop
1532 movzbl 14(%rsp), %ebp
1533 salq $8, %rbp
1534 addq %r11, %rbp
1535 movzbl (%rax), %r13d
1536 movzbl 0(%rbp,%r13), %ebp
1537 movb %bpl, 14(%rsp)
1538 movzbl %r10b, %r10d
1539 salq $8, %r10
1540 addq %r11, %r10
1541 movzbl 1(%rax), %r13d
1542 movzbl (%r10,%r13), %r10d
1543 movzbl %r9b, %r9d
1544 salq $8, %r9
1545 addq %r11, %r9
1546 movzbl 2(%rax), %r13d
1547 movzbl (%r9,%r13), %r9d
1548 movzbl %r8b, %r8d
1549 salq $8, %r8
1550 addq %r11, %r8
1551 movzbl 3(%rax), %r13d
1552 movzbl (%r8,%r13), %r8d
1553 movzbl %dil, %edi
1554 salq $8, %rdi
1555 addq %r11, %rdi
1556 movzbl 4(%rax), %r13d
1557 movzbl (%rdi,%r13), %edi
1558 movzbl %sil, %esi
1559 salq $8, %rsi
1560 addq %r11, %rsi
1561 movzbl 5(%rax), %r13d
1562 movzbl (%rsi,%r13), %esi
1563 movzbl %cl, %ecx
1564 salq $8, %rcx
1565 addq %r11, %rcx
1566 movzbl 6(%rax), %r13d
1567 movzbl (%rcx,%r13), %ecx
1568 movzbl %dl, %edx
1569 salq $8, %rdx
1570 addq %r11, %rdx
1571 movzbl 7(%rax), %r13d
1572 movzbl (%rdx,%r13), %edx
1573 addq $8, %rax
1574 cmpq %r12, %rax
1575 jne .L171
Tentative conclusion & some issues I have with it
Gathering all the information in one place, this is the conclusion I could make.
- According to llvm-mca, it is possible to dispatch four instructions on each cycle, even for code A - given that the memory latency is actually as short as presented in the diagrams above.
- However, any memory latency information provided by
llvm-mca
may be gravely incorrect, as it "has no knowledge of the cache structure or memory types" - as quoted above. - Now, when actually considering the memory hierarchy, latency predictions made by
llvm-mca
would be gravely incorrect formovzbl
s that fetch frommultTable[sum.rep][other.rep]
- all others should be on cache, if my cache is sane. Thesemovzbl
s will result in a cache miss in many cases - and might take hundreds of cycles. code B
removes some dependencies - a.k.a hazards, by writing the assembly loop ofcode A
four times and renaming some registers/memories.- If I try to draw a diagram similar to the ones generated by llvm-mca above, but this time taking into consideration cache misses, I should be getting a rough estimate of the total running time...?
I am still puzzled
- If my CPU could dispatch only four instructions at a time, what could explain the 8x speedup of code C, compared to code A?
Future plans
- Now this is clearly becoming a question unfit for stackoverflow - I will ask another question elsewhere - or look it up more.