Why does loop alignment on 32 byte make code faster?

Question

Look at this code:

one.cpp:

bool test(int a, int b, int c, int d);

int main() {
        volatile int va = 1;
        volatile int vb = 2;
        volatile int vc = 3;
        volatile int vd = 4;

        int a = va;
        int b = vb;
        int c = vc;
        int d = vd;

        int s = 0;
        __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
        __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
        __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
        __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
        for (int i=0; i<2000000000; i++) {
                s += test(a, b, c, d);
        }

        return s;
}

two.cpp:

bool test(int a, int b, int c, int d) {
        // return a == d || b == d || c == d;
        return false;
}

There are 16 nops in one.cpp. You can comment/decomment them to change alignment of the loop's entry point between 16 and 32. I've compiled them with g++ one.cpp two.cpp -O3 -mtune=native.

Here are my questions:

the 32-aligned version is faster than the 16-aligned version. On Sandy Bridge, the difference is 20%; on Haswell, 8%. Why is the difference?
with the 32-aligned version, the code runs the same speed on Sandy Bridge, doesn't matter which return statement is in two.cpp. I thought the return false version should be faster at least a little bit. But no, exactly the same speed!
If I remove volatiles from one.cpp, code becomes slower (Haswell: before: ~2.17 sec, after: ~2.38 sec). Why is that? But this only happens, when the loop aligned to 32.

The fact that 32-aligned version is faster, is strange to me, because Intel® 64 and IA-32 Architectures Optimization Reference Manual says (page 3-9):

Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch targets should be 16- byte aligned.

Another little question: is there any tricks to make only this loop 32-aligned (so rest of the code could keep using 16-byte alignment)?

Note: I've tried compilers gcc 6, gcc 7 and clang 3.9, same results.

Here's the code with volatile (the code is the same for 16/32 aligned, just the address differ):

0000000000000560 <main>:
 560:   41 57                   push   r15
 562:   41 56                   push   r14
 564:   41 55                   push   r13
 566:   41 54                   push   r12
 568:   55                      push   rbp
 569:   31 ed                   xor    ebp,ebp
 56b:   53                      push   rbx
 56c:   bb 00 94 35 77          mov    ebx,0x77359400
 571:   48 83 ec 18             sub    rsp,0x18
 575:   c7 04 24 01 00 00 00    mov    DWORD PTR [rsp],0x1
 57c:   c7 44 24 04 02 00 00    mov    DWORD PTR [rsp+0x4],0x2
 583:   00 
 584:   c7 44 24 08 03 00 00    mov    DWORD PTR [rsp+0x8],0x3
 58b:   00 
 58c:   c7 44 24 0c 04 00 00    mov    DWORD PTR [rsp+0xc],0x4
 593:   00 
 594:   44 8b 3c 24             mov    r15d,DWORD PTR [rsp]
 598:   44 8b 74 24 04          mov    r14d,DWORD PTR [rsp+0x4]
 59d:   44 8b 6c 24 08          mov    r13d,DWORD PTR [rsp+0x8]
 5a2:   44 8b 64 24 0c          mov    r12d,DWORD PTR [rsp+0xc]
 5a7:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
 5ac:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
 5b3:   00 00 00 
 5b6:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
 5bd:   00 00 00 
 5c0:   44 89 e1                mov    ecx,r12d
 5c3:   44 89 ea                mov    edx,r13d
 5c6:   44 89 f6                mov    esi,r14d
 5c9:   44 89 ff                mov    edi,r15d
 5cc:   e8 4f 01 00 00          call   720 <test(int, int, int, int)>
 5d1:   0f b6 c0                movzx  eax,al
 5d4:   01 c5                   add    ebp,eax
 5d6:   83 eb 01                sub    ebx,0x1
 5d9:   75 e5                   jne    5c0 <main+0x60>
 5db:   48 83 c4 18             add    rsp,0x18
 5df:   89 e8                   mov    eax,ebp
 5e1:   5b                      pop    rbx
 5e2:   5d                      pop    rbp
 5e3:   41 5c                   pop    r12
 5e5:   41 5d                   pop    r13
 5e7:   41 5e                   pop    r14
 5e9:   41 5f                   pop    r15
 5eb:   c3                      ret    
 5ec:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]

Without volatile:

0000000000000560 <main>:
 560:   55                      push   rbp
 561:   31 ed                   xor    ebp,ebp
 563:   53                      push   rbx
 564:   bb 00 94 35 77          mov    ebx,0x77359400
 569:   48 83 ec 08             sub    rsp,0x8
 56d:   66 0f 1f 84 00 00 00    nop    WORD PTR [rax+rax*1+0x0]
 574:   00 00 
 576:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
 57d:   00 00 00 
 580:   b9 04 00 00 00          mov    ecx,0x4
 585:   ba 03 00 00 00          mov    edx,0x3
 58a:   be 02 00 00 00          mov    esi,0x2
 58f:   bf 01 00 00 00          mov    edi,0x1
 594:   e8 47 01 00 00          call   6e0 <test(int, int, int, int)>
 599:   0f b6 c0                movzx  eax,al
 59c:   01 c5                   add    ebp,eax
 59e:   83 eb 01                sub    ebx,0x1
 5a1:   75 dd                   jne    580 <main+0x20>
 5a3:   48 83 c4 08             add    rsp,0x8
 5a7:   89 e8                   mov    eax,ebp
 5a9:   5b                      pop    rbx
 5aa:   5d                      pop    rbp
 5ab:   c3                      ret    
 5ac:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]

Did you look at the generated instructions? If so, is ist surprising that they're faster/slower? Optimizers aren't infallible. Sometimes they get lucky, sometimes they don't. Changing small things can change how the optimizer chooses to handle a particular case . If you think you have a significant performance issue for the optimizer, file a performance bug with gcc. — xaxxon, Jul 25 '17 at 09:33
things that are 32-aligned are also 16-aligned, just (potentially) half as dense — Caleth, Jul 25 '17 at 09:35
@Caleth yet he claims to be getting better performance with the less dense code. — xaxxon, Jul 25 '17 at 09:36
@geza please post the instructions generated by the compiler for each of your cases. That way we can make sure we're all talking about the same thing. gcc.godbolt.org links would be nice, too. You're not really asking any c++ questions, so posting the C++ code isn't that useful. — xaxxon, Jul 25 '17 at 09:37
Could it be something related to the instruction cache and the way the CPU fetches data from it? — dirac3000, Jul 25 '17 at 09:58
If you happen to be writing assembly anyway: The gas `.balign` instruction can emit `nop` bytes, so you can use `.balign` before the loop label. — ndim, Jul 25 '17 at 12:41
@ndim: no, I'd like stay in c++. But I'd like to use the knowledge that aligning this loop 32 makes the code faster. Thanks for the suggestion, anyway! — geza, Jul 25 '17 at 12:58
@geza: Other than secondar effects from code-size, `mov r,r` isn't faster than `mov r,imm` on Sandybridge. Probably removing `volatile` broke your silly hand-padded alignment. What if you use `-falign-loops=32` or use `asm(".p2align 5");`? mov-elimination wasn't introduced until Ivybridge. So if you were only testing SnB and HSW, they both have the same throughput (3 per clock or 4 per clock). `mov r,imm` does use an ALU port on HSW, though. — Peter Cordes, Jul 29 '17 at 08:51
@PeterCordes: you can see the addresses: 0x5b0 and 0x570 they are padded to 16 (the compiler already does 16 alignment). With "nops", I can just toggle alignment between 16/32. — geza, Jul 29 '17 at 09:02
Ah I see. I'd highly recommend not using single-byte NOPs. They will use up a lot of entries in the uop cache, and might be stopping the first 16B of your loop from fitting in the uop cache. (Which will stop it from executing from the loop buffer). I know they're not part of the loop, but I'm not sure how the CPU handles that case, where a bunch of uops that did execute before the loop are part of the same 32B block of code. — Peter Cordes, Jul 29 '17 at 09:04
@PeterCordes, yes, thanks for the tip. But this still happen, when loop is aligned to 32 (I've edited the question to mention this). And these nops are out of the loop. Do the uop cache involved even in this case? — geza, Jul 29 '17 at 09:08
@PeterCordes: These nops don't matter. If I remove them, and use `-falign-loops=32`, it still happens. 32-alignment **needed** to see the difference (if it is only aligned to 16, there's no difference, tested on Haswell). I've edited my post — geza, Jul 29 '17 at 09:22

Peter Cordes · Accepted Answer · 2022-06-23T18:10:02.710

This doesn't answer point 2 (return a == d || b == d || c == d; being the same speed as return false). That's still a maybe-interesting question, since that must compile multiple to uop-cache lines of instructions.

The fact that 32-aligned version is faster, is strange to me, because [Intel's manual says to align to 16]

That optimization-guide advice is a very general guideline, and definitely doesn't mean that larger never helps. Usually it doesn't, and padding to 32 would be more likely to hurt than help. (I-cache misses, ITLB misses, and more code bytes to load from disk).

In fact, 16B alignment is rarely necessary, especially on CPUs with a uop cache. For a small loop that can run from the loop buffer, it alignment is usually totally irrelevant.

(Skylake microcode updates disabled the loop buffer to work around a partial-register AH-merging bug, SKL150. This creates problems for tiny loops that span a 32-byte boundary, only running one iteration per 2 clocks, instead of the one iteration per 1.5 clocks you might get from a 6 uop loop on Haswell, or on SKL with older microcode. The LSD is not re-enabled until Ice Lake, broken in Kaby/Coffee/Comet Lake which are the same microarchitecture as SKL/SKX.)

Another SKL erratum workaround created another worse code-alignment pothole: How can I mitigate the impact of the Intel jcc erratum on gcc?

16B is still not bad as a broad recommendation, but it doesn't tell you everything you need to know to understand one specific case on a couple of specific CPUs.

Compilers usually default to aligning loop branches and function entry-points, but usually don't align other branch targets. The cost of executing a NOP (and code bloat) is often larger than the likely cost of an unaligned non-loop branch target.

Code alignment has some direct and some indirect effects. The direct effects include the uop cache on Intel SnB-family. For example, see Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs.

Another section of Intel's optimization manual goes into some detail about how the uop cache works:

2.3.2.2 Decoded ICache:

All micro-ops in a Way (uop cache line) represent instructions which are statically contiguous in the code and have their EIPs within the same aligned 32-byte region. (I think this means an instruction that extends past the boundary goes in the uop cache for the block containing its start, rather than end. Spanning instructions have to go somewhere, and the branch target address that would run the instruction is the start of the insn, so it's most useful to put it in a line for that block).

A multi micro-op instruction cannot be split across Ways.

An instruction which turns on the MSROM consumes an entire Way.

Up to two branches are allowed per Way.

A pair of macro-fused instructions is kept as one micro-op.

See also Agner Fog's microarch guide. He adds:

An unconditional jump or call always ends a μop cache line

lots of other stuff that that probably isn't relevant here.

Also, that if your code doesn't fit in the uop cache, it can't run from the loop buffer.

The indirect effects of alignment include:

larger/smaller code-size (L1I cache misses, TLB). Not relevant for your test
which branches alias each other in the BTB (Branch Target Buffer).

If I remove volatiles from one.cpp, code becomes slower. Why is that?

The larger instructions push the last instruction into the loop across a 32B boundary:

 59e:   83 eb 01                sub    ebx,0x1
 5a1:   75 dd                   jne    580 <main+0x20>

So if you aren't running from the loop buffer (LSD), then without volatile one of the uop-cache fetch cycles gets only 1 uop.

If sub/jne macro-fuses, this might not apply. And I think only crossing a 64B boundary would break macro-fusion.

Also, those aren't real addresses. Have you checked what the addresses are after linking? There could be a 64B boundary there after linking, if the text section has less than 64B alignment.

Also related to 32-byte boundaries, the JCC erratum disables the uop cache for blocks where a branch (including macro-fused ALU+JCC) includes the last byte of the line, on Skylake CPUs. How can I mitigate the impact of the Intel jcc erratum on gcc?

Sorry I haven't actually tested this to say more about this specific case. The point is, when you bottleneck on the front-end from stuff like having a call/ret inside a tight loop, alignment becomes important and can get is extremely complex. Boundary-crossing or not for all future instructions is affected. Do not expect it to be simple. If you've read my other answers, you'll know I'm not usually the kind of person to say "it's too complicated to fully explain", but alignment can be that way.

See also Code alignment in one object file is affecting the performance of a function in another object file

In your case, make sure tiny functions inline. Use link-time optimization if your code-base has any important tiny functions in separate .c files instead of in a .h where they can inline. Or change your code to put them in a .h.

Thanks Peter, it is a very useful answer and links. Just some comments: these addresses are printed by objdump on the executable, so they are after linking. It think they are relative to the program segment, which is aligned to 4K (AFAIK). Thanks for the suggestion about inlining tiny functions. This time, I've intentionally created this way, as I wanted to benchmark `test()` function isolated. But as this question and answer showed, this cannot be done at all, there are too many other factors which completely invalidate any result. — geza, Jul 30 '17 at 00:06
@geza: It's too small for that to be meaningful. Its performance will depend on the surrounding code (which might bottleneck on latency of a dep chain, front-end throughput, or a specific execution port), and whether you're using it in an `if()` or other flow control vs. storing the boolean result to memory. — Peter Cordes, Jul 30 '17 at 00:39
Also, whether all 4 inputs become ready simultaneously, or if some of the 4 variables are ready early, so out-of-order execution can overlap a bunch of work with other stuff. — Peter Cordes, Jul 30 '17 at 00:42
Yepp :) Actually (sorry, I didn't make this clear in my question), my main concern was about 1., the other two is not that important. I've asked this question, because this was not the first time I got performance boost for aligning loops to 32, and now, as I'm a member of SO now, I was that curious to actually ask it :) — geza, Jul 30 '17 at 00:45
Interestingly, there seems [to be a reasonable chance](https://news.ycombinator.com/item?id=16198171) that the uop cache has been expanded to 64-byte blocks in Skylake (client). If true, it makes loop alignment less feasible: the amount of nops needed is very large. — BeeOnRope, Jan 22 '18 at 08:28
@xiver77: Thanks, fixed, and updated with other code-alignment problems that errata workarounds have created. — Peter Cordes, Jun 23 '22 at 18:10

Why does loop alignment on 32 byte make code faster?

1 Answers1

Linked

Related