4

I found that x86-64 programs (at least those compiled using GCC) have functions start by default at addresses aligned to multiples of 16 bytes and that the padding is done by NOP instructions with as many prefixes as could fit to optimally fill the space. For example,

  (...)
  447454:   c3                              retq   
  447455:   90                              nop
  447456:   66 2e 0f 1f 84 00 00 00 00 00   nopw   %cs:0x0(%rax,%rax,1)

0000000000447460 <__libc_csu_fini>:
  447460:   f3 c3                           repz retq 

What's the advantage to filling the space with regular NOPs like observed here or here?

The Vee
  • 11,420
  • 5
  • 27
  • 60
  • The answers to those linked questions seem to apply here as well – Hong Ooi Sep 22 '17 at 06:41
  • @HongOoi They say why aligning is good, but not why a `nopw %cs:0x0(%rax,%rax,1)` is better than 10× `nop`. Actually that possibility is only mentioned once; all the examples are just repeated `nop`s and one of the quotes even says "Aligning a subroutine entry is as simple as putting **as many NOP 's** as needed..." – The Vee Sep 22 '17 at 06:45
  • Aligning branch targets to a multiple of 16 is a standard optimization rule. It helps the instruction decoder deal with a branch misprediction, it doesn't have to plow through instructions that are not used. The REP prefix is a wonky one that helps AMD processors, it avoids a misprediction. Dated btw, they now recommend RET 0. Processors did not get easier to program ;) – Hans Passant Sep 22 '17 at 09:33
  • @HansPassant: `rep ret` is faster than `ret 0` on AMD Bulldozer-family and Ryzen. And also on Intel CPUs. They're the same speed on K8/K10. So it's a Good Thing that `gcc`is using `rep ret` instead of `ret 0` in its "generic" tuning that still cares about Athlon64 / PhenomII. There's no sense in switching now, since `rep ret` exists in the wild in binaries everywhere (from `gcc`'s default output), so no CPU vendor will repurpose that byte sequence for decades. https://stackoverflow.com/questions/20526361/what-does-rep-ret-mean/32347393#32347393. `ret 0` would have been "safer" initially. – Peter Cordes Sep 22 '17 at 11:42
  • _Processors did not get easier to program_ Ain't that the truth. –  Sep 23 '17 at 14:27

1 Answers1

4

There's no downside, so why not? It makes the disassembly easier to read for humans, because you don't have a huge amount of lines separating functions.

GCC (the actual compiler part that transforms C to assembly) uses the same .p2align directive to ask the assembler to insert padding whether it's inside a function to align branch targets, or whether it's between functions to align function entry points.

GCC could emit .p2align 4,,0x90 to ask the assembler to fill with single-byte NOPs in cases where the NOPs won't be executed, but like I said, there's no reason to bother doing that instead of .p2align 4 (pad out to the next 2^4 boundary with the default choice of filler).


If the end of the function is an indirect branch (tail-call with jmp [rax] or something), speculative execution could run into these NOP instructions. Decoding many short NOPs could overflow the uop cache on Intel SnB-family. (more than 3 cache lines of up-to-6 uops per 32-byte block). (http://agner.org/optimize/ microarch pdf). Long NOPs are potentially better for that.

IDK how Pentium4's trace cache builder behaved; maybe it was useful for that, too? Again, fewer longer NOP instructions are less likely to trigger anything weird in the front-end of a CPU before it figures out that the NOPs aren't executed.

MSVC pads with int3 between functions, IIRC, which will stop speculative execution. That's not a bad idea.

This is guesswork; it's probably not a real factor in performance; if it still mattered on modern CPUs, all compilers would probably avoid short NOPs between functions, but as one of your links showed, not all do.

Some CPUs, like AMD K8/K10 and Bulldozer-family, mark instruction-lengths in L1I cache. Agner Fog says that bandwidth from L2 to L1I is low on K8/K10, and guesses that it may be from adding extra pre-decode information. IDK if this takes longer when there are lots of small instructions? It would have to know where to start decoding, because the middle of an instruction can span a cache-line boundary. IDK how that works.


BTW, these instructions might be decoded as part of a group containing a normal ret, but I don't think there's anything to worry about either way in that case.

Decoding happens in 2 stages in some CPUs: first, instruction-length decoding, which finds blocks of up-to-16 bytes containing up-to-4 instructions (e.g. on Intel P6-family / Sandybridge-family). Then it feeds those blocks to the decoders.

With correct branch prediction for the ret, even nasty stuff like LCP stalls after the ret don't seem to hurt.

Anyway, I don't think this difference is significant. Decoded NOP instructions after a RET should be cancelled before they go anywhere, because the RET is an unconditional branch. I probably makes no difference whether the instruction-length decoder finds many single-byte instructions vs. some prefixes but not the end of an instruction before the end of a 16-byte window.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I've noted that both simple and long `nop` decode to a single uop dispatched to the IDQ (is *dispatched* the correct term here?) but, of course, never issued (again, is* issued* correct?) and just retired. I've not measured if a long `nop` can be handled by the simple decode but suppose it can. It is correct to assume that in this case filling a 16 bytes block with simple `nops` would bottleneck the front-end (4/6uops per cycle) before reaching the end of the block, while using a few long `nop` wouldn't? – Margaret Bloom Sep 22 '17 at 09:09
  • @MargaretBloom: In Intel terminology, there's no special word for adding uops to the IDQ. *dispatched*: scheduler -> execution unit. *issued*: added to out-of-order core from the front-end. (Some computer-architecture people use "issued" to describe uops / instructions being sent to execution units; what Intel calls "dispatch".) – Peter Cordes Sep 22 '17 at 09:13
  • @MargaretBloom: Yes, filling a 16B block with short `nop`s would bottleneck the front-end. So you never want to do that for blocks that will ever be executed. If the front-end is decoding `nop`s between functions, it's not doing anything useful, so it doesn't really matter what it does, as long as it doesn't evict valuable uop cache lines in cases like a function ending with a `jmp [rax]`. (Unless it's useful to have it fetch the beginning of the next function... But more likely you don't want to pollute the I-cache with the next function.) – Peter Cordes Sep 22 '17 at 09:17
  • @MargaretBloom: I think a `nop` after a `ret` will be cancelled before it's even added to the IDQ, because at that point the decoders know there's an unconditional branch (`ret` is an indirect jmp, but I doubt the default prediction is next-instruction). And if branch-prediction is working well, the instructions linearly after the `ret` aren't sent to the decoder at all, because it predicted that there's a taken-branch there. The queue of instruction-bytes that's sent to the decoders might have the next instruction in the caller right after the `ret`, decoded in the same cycle. (IDK). – Peter Cordes Sep 22 '17 at 09:21
  • 1
    @MargaretBloom: BTW, long `nop` is a single uop, so yes it can be handled by any decoder. As long as you avoid using too many prefixes (which AMD decoders choke on, and Atom / Silvermont / early P6), AMD and Intel CPUs can blow through them with no bottlenecks other than the usual instruction-length / 16B block stuff. They have to issue/retire, but use no execution unit. Exactly as cheap as xor-zeroing on Sandybridge-family, or an eliminated `mov`. (If NOPs were common, CPUs could maybe avoid spending a ROB entry on each one, but usually just don't put `nop` inside hot loops :P) – Peter Cordes Sep 22 '17 at 09:34
  • I really like the first part, that the padding doesn't necessarily happen only at function boundary or in other places where it's 100% sure the control will never reach. When potentially executed, it's certainly better to pack in as little instructions as possible. – The Vee Sep 22 '17 at 11:24
  • @TheVee: If it mattered what was put between functions, gcc could easily put something different. It knows whether it's inside a function or not when emitting assembler directives and instructions. It's not a hard problem, it's just that there's no reason *not* to use a different NOP strategy outside functions. (Maybe you didn't realize that gcc did sometimes pad with NOP inside functions?) – Peter Cordes Sep 22 '17 at 11:44
  • Indeed, I did not realize the latter. I meant that with this information the long instructions are clearly advantageous, and it makes a lot of sense to use the same mechanism outside the functions as well where the choice does not matter. – The Vee Sep 22 '17 at 11:56