2

(Note: there is an update at the bottom of the question)

Look at this stripped down small_vector benchmark:

#include <cstdlib>

#define NOINLINE __attribute__((noinline))

class Array {
public:
    Array() : m_data(m_buffer), m_size(0) {}
    ~Array() {
        // if (m_data != m_buffer) free(m_data);
    }

    void append(int v) {
        if (m_size >= 3) expand();
        m_data[m_size++] = v;
    }

private:
    int *m_data;
    int m_size;
    int m_buffer[3];

    NOINLINE void expand() {}
};

NOINLINE
void add(Array &array) {
    array.append(11);
    array.append(22);
    array.append(33);
}

int main() {
    for (int i = 0; i < 1000000000; i++) {
        Array array;
        add(array);
    }
}

(NOINLINE is used because this is how the compiler decided to inline the original small_vector code)

If this code is compiled with clang 11, it becomes faster if I uncomment the commented line in ~Array (note, the free call is never executed). On my machine (i7 8750), the difference is 18%. On quick-bench.com, the difference is smaller, 5.3%.

I understand that this is a microbenchmark, wild things can happen, but: add is a 37-instruction, 120-byte code, so it is not that small. And it is the same in both versions. The only difference is main, which is not that different either, it's just the loop which is compiled a little bit differently. Yet, there is a large performance difference, and the version with more instructions/branches runs faster. If I run perf stat -d -d -d, I don't see anything suspicious (no significant difference for branch/cache-misses, but still, insn/cycle difference is huge: 2.4 vs 3.12):

Slow

       3939.23 msec task-clock                #    1.000 CPUs utilized
            10      context-switches          #    0.003 K/sec
             0      cpu-migrations            #    0.000 K/sec
           107      page-faults               #    0.027 K/sec
   13345669446      cycles                    #    3.388 GHz                      (38.26%)
   32029499558      instructions              #    2.40  insn per cycle           (45.98%)
    6005787151      branches                  # 1524.610 M/sec                    (45.99%)
         71062      branch-misses             #    0.00% of all branches          (46.09%)
    6000238616      L1-dcache-loads           # 1523.202 M/sec                    (46.20%)
        180237      L1-dcache-load-misses     #    0.00% of all L1-dcache accesses  (46.30%)
         35516      LLC-loads                 #    0.009 M/sec                    (30.87%)
         13655      LLC-load-misses           #   38.45% of all LL-cache accesses  (30.87%)
not supported       L1-icache-loads
        545548      L1-icache-load-misses                                         (30.87%)
    6003584439      dTLB-loads                # 1524.051 M/sec                    (30.86%)
          5290      dTLB-load-misses          #    0.00% of all dTLB cache accesses  (30.76%)
          4583      iTLB-loads                #    0.001 M/sec                    (30.65%)
          4222      iTLB-load-misses          #   92.12% of all iTLB cache accesses  (30.55%)
not supported       L1-dcache-prefetches
not supported       L1-dcache-prefetch-misses

   3.939756460 seconds time elapsed

   3.939678000 seconds user
   0.000000000 seconds sys

Fast

       3316.00 msec task-clock                #    1.000 CPUs utilized
             5      context-switches          #    0.002 K/sec
             0      cpu-migrations            #    0.000 K/sec
           110      page-faults               #    0.033 K/sec
   11235910328      cycles                    #    3.388 GHz                      (38.24%)
   35013565821      instructions              #    3.12  insn per cycle           (45.96%)
    7002622651      branches                  # 2111.770 M/sec                    (45.96%)
         59596      branch-misses             #    0.00% of all branches          (46.02%)
    7001546754      L1-dcache-loads           # 2111.446 M/sec                    (46.14%)
        143554      L1-dcache-load-misses     #    0.00% of all L1-dcache accesses  (46.26%)
         20608      LLC-loads                 #    0.006 M/sec                    (30.88%)
          3562      LLC-load-misses           #   17.28% of all LL-cache accesses  (30.88%)
not supported       L1-icache-loads
        431694      L1-icache-load-misses                                         (30.88%)
    7003243717      dTLB-loads                # 2111.958 M/sec                    (30.88%)
          3296      dTLB-load-misses          #    0.00% of all dTLB cache accesses  (30.82%)
          2836      iTLB-loads                #    0.855 K/sec                    (30.70%)
          3436      iTLB-load-misses          #  121.16% of all iTLB cache accesses  (30.58%)
not supported       L1-dcache-prefetches
not supported       L1-dcache-prefetch-misses

   3.316414943 seconds time elapsed

   3.312479000 seconds user
   0.003995000 seconds sys

Note: if I unroll the loop manually in main to this:

    for (int i = 0; i < 250000000; i++) {
        {Array array; add(array);}
        {Array array; add(array);}
        {Array array; add(array);}
        {Array array; add(array);}
    }

the performance difference stays the same.

Do you have any clue what's the cause of the difference? Any tips which perf event to check?

For reference, here are the asm listings:

Slow main

  401190:   55                      push   rbp
  401191:   41 56                   push   r14
  401193:   53                      push   rbx
  401194:   48 83 ec 20             sub    rsp,0x20
  401198:   bd 00 ca 9a 3b          mov    ebp,0x3b9aca00
  40119d:   4c 8d 74 24 14          lea    r14,[rsp+0x14]
  4011a2:   48 8d 5c 24 08          lea    rbx,[rsp+0x8]
  4011a7:   66 0f 1f 84 00 00 00    nop    WORD PTR [rax+rax*1+0x0]
  4011ae:   00 00 
  4011b0:   4c 89 74 24 08          mov    QWORD PTR [rsp+0x8],r14
  4011b5:   c7 44 24 10 00 00 00    mov    DWORD PTR [rsp+0x10],0x0
  4011bc:   00 
  4011bd:   48 89 df                mov    rdi,rbx
  4011c0:   e8 4b ff ff ff          call   401110 <add(Array&)>
  4011c5:   83 c5 ff                add    ebp,0xffffffff
  4011c8:   75 e6                   jne    4011b0 <main+0x20>
  4011ca:   31 c0                   xor    eax,eax
  4011cc:   48 83 c4 20             add    rsp,0x20
  4011d0:   5b                      pop    rbx
  4011d1:   41 5e                   pop    r14
  4011d3:   5d                      pop    rbp
  4011d4:   c3                      ret    

Fast main

  4011b0:   55                      push   rbp
  4011b1:   41 56                   push   r14
  4011b3:   53                      push   rbx
  4011b4:   48 83 ec 20             sub    rsp,0x20
  4011b8:   bd 00 ca 9a 3b          mov    ebp,0x3b9aca00
  4011bd:   48 8d 5c 24 14          lea    rbx,[rsp+0x14]
  4011c2:   4c 8d 74 24 08          lea    r14,[rsp+0x8]
  4011c7:   eb 0c                   jmp    4011d5 <main+0x25>
  4011c9:   0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]
  4011d0:   83 c5 ff                add    ebp,0xffffffff
  4011d3:   74 26                   je     4011fb <main+0x4b>
  4011d5:   48 89 5c 24 08          mov    QWORD PTR [rsp+0x8],rbx
  4011da:   c7 44 24 10 00 00 00    mov    DWORD PTR [rsp+0x10],0x0
  4011e1:   00 
  4011e2:   4c 89 f7                mov    rdi,r14
  4011e5:   e8 46 ff ff ff          call   401130 <add(Array&)>
  4011ea:   48 8b 7c 24 08          mov    rdi,QWORD PTR [rsp+0x8]
  4011ef:   48 39 df                cmp    rdi,rbx
  4011f2:   74 dc                   je     4011d0 <main+0x20>
  4011f4:   e8 37 fe ff ff          call   401030 <free@plt>
  4011f9:   eb d5                   jmp    4011d0 <main+0x20>
  4011fb:   31 c0                   xor    eax,eax
  4011fd:   48 83 c4 20             add    rsp,0x20
  401201:   5b                      pop    rbx
  401202:   41 5e                   pop    r14
  401204:   5d                      pop    rbp
  401205:   c3                      ret    

Add

(same for slow and fast)

  401130:   53                      push   rbx
  401131:   48 89 fb                mov    rbx,rdi
  401134:   8b 4f 08                mov    ecx,DWORD PTR [rdi+0x8]
  401137:   83 f9 03                cmp    ecx,0x3
  40113a:   7c 0b                   jl     401147 <add(Array&)+0x17>
  40113c:   48 89 df                mov    rdi,rbx
  40113f:   e8 cc 00 00 00          call   401210 <Array::expand()>
  401144:   8b 4b 08                mov    ecx,DWORD PTR [rbx+0x8]
  401147:   48 8b 03                mov    rax,QWORD PTR [rbx]
  40114a:   8d 51 01                lea    edx,[rcx+0x1]
  40114d:   89 53 08                mov    DWORD PTR [rbx+0x8],edx
  401150:   48 63 c9                movsxd rcx,ecx
  401153:   c7 04 88 0b 00 00 00    mov    DWORD PTR [rax+rcx*4],0xb
  40115a:   8b 4b 08                mov    ecx,DWORD PTR [rbx+0x8]
  40115d:   83 f9 03                cmp    ecx,0x3
  401160:   7c 0e                   jl     401170 <add(Array&)+0x40>
  401162:   48 89 df                mov    rdi,rbx
  401165:   e8 a6 00 00 00          call   401210 <Array::expand()>
  40116a:   8b 4b 08                mov    ecx,DWORD PTR [rbx+0x8]
  40116d:   48 8b 03                mov    rax,QWORD PTR [rbx]
  401170:   8d 51 01                lea    edx,[rcx+0x1]
  401173:   89 53 08                mov    DWORD PTR [rbx+0x8],edx
  401176:   48 63 c9                movsxd rcx,ecx
  401179:   c7 04 88 16 00 00 00    mov    DWORD PTR [rax+rcx*4],0x16
  401180:   8b 4b 08                mov    ecx,DWORD PTR [rbx+0x8]
  401183:   83 f9 03                cmp    ecx,0x3
  401186:   7c 0e                   jl     401196 <add(Array&)+0x66>
  401188:   48 89 df                mov    rdi,rbx
  40118b:   e8 80 00 00 00          call   401210 <Array::expand()>
  401190:   8b 4b 08                mov    ecx,DWORD PTR [rbx+0x8]
  401193:   48 8b 03                mov    rax,QWORD PTR [rbx]
  401196:   8d 51 01                lea    edx,[rcx+0x1]
  401199:   89 53 08                mov    DWORD PTR [rbx+0x8],edx
  40119c:   48 63 c9                movsxd rcx,ecx
  40119f:   c7 04 88 21 00 00 00    mov    DWORD PTR [rax+rcx*4],0x21
  4011a6:   5b                      pop    rbx
  4011a7:   c3                      ret    

UPDATE

I managed to make the difference even larger, change main to this (it's just an added malloc and free call):

void *d;
int main() {
    d = malloc(1);
    for (int i = 0; i < 1000000000; i++) {
        Array array;
        add(array);
    }
    free(d);
}

With this main, the slow version becomes even slower, the difference between slow and fast is ~30%!

geza
  • 28,403
  • 6
  • 61
  • 135
  • 1
    Your CPU is Skylake-family so the JCC erratum is a possible cause of lower front-end throughput. Worth checking the `idq.mite_uops` event to see if a significant fraction of your `uops_issued.any` came from legacy decode instead of the uop cache. If so, look for any branch crossing or touching the bottom of a 32-byte boundary. [32-byte aligned routine does not fit the uops cache](https://stackoverflow.com/a/61016915). (And/or try compiler / assembler options to have it avoid the problem: [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646)) – Peter Cordes Jan 07 '22 at 12:11
  • 1) You can post more readable assembly by telling your compiler to emit it directly (`-S`), instead of disassembling. 2) Did you enable compiler optimizations? 3) Did you use a benchmarking framework like [coz](https://github.com/plasma-umass/coz) to ensure your speedup is actually causally related to the change? – EOF Jan 07 '22 at 12:13
  • @PeterCordes: Thanks! I'm not familiar with these things (yet :) ), here are the event counters (`uops_issued.any`/`idq.mite_uops`): slow: 32,276,403,912/864,863,938, fast: 32,701,518,131/154,423,750. So the `idq.mite_uops` is ~5.5x higher for the slow case, but in both cases, it's insignificant compared to `uops_issued.any` (I'm not sure whether these have the same measurement unit, so I don't know whether it's a significant fraction or not). – geza Jan 07 '22 at 12:38
  • 1
    @EOF: 1) I like disassembly better because it has intel syntax, and also code alignment is visible. 2) Yes, (-03). 3) I'll check coz, thanks, but I think that the speedup is related to the change (I'm not sure what causally means in this context, but I see that coz also uses this term, so I think I'll learn the meaning if I read the docs). – geza Jan 07 '22 at 12:44
  • 2
    `gcc -masm=intel -S` uses the same GAS `.intel_syntax noprefix` that `objdump -d -Mintel` uses. But yeah, disassembly is actually good in this case because some microarchitectural effects are related to code alignment. Re: "causally": that's just an English / science word, not a benchmark term. "causally related" means "one thing is caused by another", as opposed to "these things happen together, caused by some third thing". (you've probably heard "correlation is not causation"; EOF is suggesting ruling out benchmark methodology errors to actually check for causation). – Peter Cordes Jan 07 '22 at 12:48
  • 1
    Peter, thanks again! @EOF I'm not sure that we can say that the speedup is strictly casually related the the change. But nevertheless, we have two programs, almost identical, they should run at a similar speed. I'm curious about the reason for the speed difference. – geza Jan 07 '22 at 12:56
  • 2
    clang 10+ supports `-mbranches-within-32B-boundaries`. So you can try it to rule out JCC erratum quickly. – rustyx Jan 07 '22 at 15:19
  • @rustyx: Thanks for the suggestion, I tried it. It made both versions a little bit slower, and the speed difference is still there. So it seems that this issue is not caused by this JCC erratum. (but the code has been rearranged a little bit by the compiler, so the comparison is not entirely correct. But as the speed difference with `-mbranches-within-32B-boundaries` is almost the same as without it, I suppose that the root cause of the speed difference is still the same) – geza Jan 07 '22 at 15:57

1 Answers1

3

The problem comes from two combined issues:

  • the JCC erratum;
  • the alignment of the local variables in the stack.

Indeed, I can reproduce the problem on my i5-9600KF Skylake processor and the event idq.mite_uops was pretty big for the slow version compared to the fast one as pointed by @PeterCordes in the comments. Using the Clang flag -mbranches-within-32B-boundaries appear to solve this problem since idq.mite_uops is then much smaller. However, this is not enough to remove the difference between the two versions.

Moreover, instructions like mov DWORD PTR [rbx+0x8],edx or mov ecx,DWORD PTR [rdi+0x8] are significantly slower in the slow version. The three mov DWORD PTR [rax+rcx*4],xxx take a lot of time in both versions, but especially in the slow one. One can note that rax+rcx*4 point to the stack of the main function. It turns out that rbx and rbx+0x8 point to two different cache lines in the slow version while they point on the same cache line in the fast version. Additionally, rax+rcx*4 point to the same cache line of rbx+0x8 in both versions. Using the Clang flag -mstackrealign fixed the difference between the two versions on my machine (the resulting time is slower than the fast version but faster than the slow version). However, the alignment is quite bad for both versions with this flag. A better alternative is simply to replace int *m_data by alignas(64) int *m_data (use this for a more portable code).

            flags / config                       |  "slow"  |  "fast"
nothing                                          |   2,950  |   2,418
-mbranches-within-32B-boundaries                 |   2,868  |   2,472
-mstackrealign                                   |   2,669  |   2,833
-mstackrealign -mbranches-within-32B-boundaries  |   2,701  |   2,725
alignas(64)                                      |   2,585  |   2,583
-mbranches-within-32B-boundaries alignas(64)     |   2,497  |   2,499
Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • 2
    Giving the class 64-byte alignment with `alignas(64)` on the first member seems overkill for a 6-int struct (assuming pointers are the width of 2 ints); it's less than 32 bytes so `alignas(32)` would ensure that a single such struct is never split across a cache-line boundary. – Peter Cordes Jan 10 '22 at 02:51
  • 1
    (I forget the details of this problem, but is there multi-threading here? If no, I wouldn't expect `hardware_destructive_interference_size` to be relevant. That would be 128 on SnB-family to account for stuff like adjacent-line L2 prefetching, if compilers chose to define it; they don't because it would become part of the ABI if used as intended, but that would defeat the purpose of having it be relevant to current HW.) – Peter Cordes Jan 10 '22 at 02:55
  • Thanks for the answer! I tried aligning, and it does speed up the slow version. However, I'm not entirely sure about this. If I intentionally misalign `m_data` (for example, add a `alignas(64) char pad[8];` member before `int *m_data`), the program still runs fast. I tried all the paddings N*8 (N=0..7), it always run fast. Plus, the necessary data should be in the cache all the time. Is there any known penalty if the cache is continuously written to different cache lines opposed to the same cache line? – geza Jan 10 '22 at 19:37
  • @PeterCordes Thank you for the informations. Indeed a 32-byte alignment should be enough. For the `hardware_destructive_interference_size`, this is just because this the only way I know to get a multiple of the cache-line size in a portable way in C++. I though it was 64 on all x64 architectures. – Jérôme Richard Jan 11 '22 at 00:43
  • @geza `alignas(64) char pad[8];` should not cause a slowdown because the fetched attributes will still all be stored in the same cacheline. I expect a slowdown with a padding like >=40 (since the data structure should be 24-byte wide). – Jérôme Richard Jan 11 '22 at 00:50
  • It's very likely that this is not an alignment issue, but some kind of data dependency issue. If I modify the slow version, that in `main`, it uses `r15` instead of `rbx`, it becomes fast. As `add` pop's `rbx` at the end, it may cause some kind of stall later (for example, when `main` moves it to `rdi`. Or even later, when `add` uses `edi`). – geza Jan 11 '22 at 00:50
  • 1
    @geza Complex things like store-forwarding might play a role. I did not succeed to find evidence of such an hardware optimization using `perf` hardware events though. I was initially thinking about a dependency issue initially too. The thing is I did not found any hardware counter showing *precisely* and *clearly* the source of the problem on Skylake... The best I found is that there is a lot of instructions stalling (in the backend AFAIR) and the source of the slowdown comes from memory mov instructions. – Jérôme Richard Jan 11 '22 at 01:10
  • You can check it out, I put the asm here (you can compile it with clang just like if it were a cpp file): https://pastebin.com/EKHG3q98. I marked two lines with "####". If you change the register from `r15` to `rbx` there, the program will slow down considerably. Another weird thing is, that the performance of the slow version varies a lot, it's between 1.25-1.45 sec. But the fast version runs in 1.0 sec very consistently. – geza Jan 11 '22 at 01:13
  • @JérômeRichard: The cache line size *is* 64 bytes on all x86-64 architectures. That's why an aligned pair of cache lines is 128 bytes. If you want the line size, use `hardware_constructive_interference_size` - how close things have to be to definitely get pulled in together, rather than *maybe* pulled in together when you don't want it. That's why `...destructive...` size (should) take into account prefetcher behaviour that affects multiple cache lines. (Separate sizes can also account for systems where different levels of cache have different line sizes.) – Peter Cordes Jan 11 '22 at 01:17
  • Yeah, me neither. I'm not familiar with perf events too much, so I just bruteforced it: I did a stat on all events, one by one, and I hoped that I'd find something. But I didn't find any obvious event which shows the cause of the difference. – geza Jan 11 '22 at 01:18
  • @geza and JérômeRichard: With store forwarding possibly involved, that raises the issue of Sandybridge-family's variable-latency store-forwarding, which can be less than 5 cycles, but *only if you don't try to reload until the data's actually ready* - [Adding a redundant assignment speeds up code when compiled without optimization](https://stackoverflow.com/q/49189685). IIRC the best case is something like 3 cycles. Sometimes adding NOPs or other front-end slowdowns can keep the reload out of the back-end for long enough and actually be an overall speedup. – Peter Cordes Jan 11 '22 at 01:24
  • (Of course usually the best solution is to keep the data out of memory; the adding NOPs thing is mostly something you'd do in an experiment to investigate the effect, not as a final optimization.) – Peter Cordes Jan 11 '22 at 01:26