Why and where align 16 is used for SSE alignment for instructions?

Question

I am reading Modern x86 Assembly language book from Apress. For programming 64 bit SSE examples the author puts align 16 to a particular point in the code. E.g

    .code
ImageUint8ToFloat_ proc frame
_CreateFrame U2F_,0,64               ; helper macros to create prolog
_SaveXmmRegs xmm10,xmm11,xmm12,xmm13 ; helper macros to create prolog

_EndProlog  ; helper macros to create prolog

...

shrd r8d,
pxor xmm5,xmm5

align 16  ; Why this is here ?
@@:
movdqa xmm0,xmmword ptr [rdx]
movdqa xmm10,xmmword ptr [rdx+16]

movdqa xmm2,xmm0
punpcklbw xmm0,xmm5
punpckhbw xmm2,xmm5
movdqa xmm1,xmm0
movdqa xmm3,xmm2

...

The author explains it is necessary to put align 16 since we are using SSE so that instructions themselves are aligned. That's fine. My question is why the author choose to put align 16 to that particular location. As a programmer how should I decide for the correct location of align 16 ? Why not earlier or later ?

It doesn't actually make much sense - SSE *data* typically needs to be 16 byte aligned, but instructions don't. — Paul R, Sep 21 '16 at 10:46
Perhaps you are right. I might have misread. I reread the section and it says : "Align branch targets in performance-critical loops to 16-byte • boundaries. " I guess that's the real reason not SSE. It's a jump target. — Onur Gumus, Sep 21 '16 at 10:54

score 5 · Accepted Answer · answered Sep 21 '16 at 10:50

It isn't necessary. It is occasionally beneficial.

Modern processors fetch code in blocks of 16 (or maybe 32, sort of, AMD does weird things) bytes. Aligned, of course. If you jump near the end of such a block, you waste most of that fetch, and in that cycle you decode only 1 or many 0 instructions. That's a giant waste, so it's better to jump to the start of a block.

That doesn't always matter, for example if the code is in the loop buffer or µop cache (if it exists). Typically just about any loops fits in the µops cache, on processors older than SandyBridge it was fairly easy to make a loop that didn't fit in the loop buffer, making fetch throughput important. Even when loops could fit in the loop buffer, alignment still helped on Core2 because misalignment effectively makes the loop buffer smaller there (it is based on the 16byte blocks of code, cached after predecoding). There are some more weird details, but it's all about ancient µarchs so I'll skip it. The point is, on µarchs like Nehalem and older, you should often align loops.

Though it's not super clear from the fragment, it looks like they've aligned a label to which it will loop back. So it's aligning the loop. It's not important on modern µarchs.

Yeah, I think I misread it. It is exactly for what you stated. To increase the jump performance. Author says: "Align branch targets in performance-critical loops to 16-byte boundaries. " — Onur Gumus, Sep 21 '16 at 10:58
SnB-family uop-cache works on 32B chunks of machine code. A 32B x86 code boundary always ends the uop cache line. (And a 32B block of x86 code can have up to three uop cache lines, each of which can hold up to 6 uops in ideal circumstances.) — Peter Cordes, Sep 21 '16 at 11:11

Why and where align 16 is used for SSE alignment for instructions?

1 Answers1

Linked