1

I have a question about jumping in assembly. Is it necessary for a function label to be aligned at an 8-BYTE boundary when we want to jump to it?
For example:

func:
        jmp     .ret1
 .ret1: mov     eax, 1
        ret
 .ret2: mov     eax, 2
        ret

Here I have a function with 2 labels and I just jumped to the first one. Is it necessary for each label to be aligned at an 8-BYTE boundary?
Is the following necessary:

func:
        jmp     .ret1
   ALIGN 8
 .ret1: mov     eax, 1
        ret
   ALIGN 8
 .ret2: mov     eax, 2
        ret

What about functions? Each function must be aligned at an 8-BYTE boundary or is it not important?

Sep Roland
  • 33,889
  • 7
  • 43
  • 76
HelloGUI
  • 121
  • 7
  • 3
    Does this answer your question? [How much does function alignment actually matter on modern processors?](https://stackoverflow.com/questions/22235236/how-much-does-function-alignment-actually-matter-on-modern-processors) – teapot418 May 20 '23 at 19:50
  • 3
    Necessary, no. Advisable, ... eh, somewhat. But this is a microoptimization, unless you are dealing with a really hot code path, you're probably not going to notice a difference. – teapot418 May 20 '23 at 19:52
  • Also related: [Why does loop alignment on 32 byte make code faster?](https://stackoverflow.com/q/45298870) – Peter Cordes May 22 '23 at 01:20
  • It's not required on x86 at least. Other architectures have their own alignment rules. – puppydrum64 Jun 29 '23 at 17:42

1 Answers1

3

There is no such requirement. However, aligning function entry points and other jump targets to 16 byte may improve performance when the code is not in the µop cache. The effect is fairly minor though. When in doubt, benchmark.

fuz
  • 88,405
  • 25
  • 200
  • 352
  • but if i have over 50 label (for jumping) or i align each of them to 16, there will be huge source-code size (L1 cache) right? – HelloGUI May 20 '23 at 20:36
  • 1
    @HelloGUI: Right, so don't do that. Use a lookup table of data if you can, instead of jumping to `mov eax, imm32` / `ret` blocks. If that's not possible, if your code block is short and ends before the end of an aligned 16-byte block, there's probably no benefit at all to having it start at the beginning of a 16-byte block. And even if some of your blocks do span across aligned 16-byte boundaries, it's probably better than more L1i misses from a larger footprint, especially if all entries are frequently jumped to. If not, only the ones actually used will stay hot in L1i. – Peter Cordes May 20 '23 at 21:13
  • @HelloGUI If it's code that is related or jumped to in quick succession, aligning the jump targets is likely not to make a difference due to the µop cache. – fuz May 20 '23 at 21:45
  • @PeterCordes I need to make an example with a real function ... should I delete this question and make a new one or just put source-code here with an update? – HelloGUI May 21 '23 at 05:31
  • 1
    @HelloGUI: You actually can't delete this one since it has an upvoted answer. If you want advice about the tradeoffs in micro-optimizing a specific function (for some specific CPUs, like Zen 3 and later, and/or Ice Lake / Alder Lake P and E cores), then yeah, ask that. Make sure to include some details about how often it's called and how much I-cache and uop-cache pressure there is between calls, and how frequent the different cases are. (And presumably you care about throughput on average, although real-time code might care more about making the worst case less bad, not the common case.) – Peter Cordes May 21 '23 at 05:48