Using the operand-size override prefix 0x66 for instruction alignment

Question

Recently I came across the legacy 0x66 operand-size override prefix.
Could it be used to align instructions without explicitly writing a single or multi-byte NOP instruction?

For example, adding the align 16 directive:

int   3
mov   rax,1        
align 16
add   rcx,rax

generates this disassembly:

...1000 cc               int     3
...1001 48c7c001000000   mov     rax,1
...1008 0f1f840000000000 nop     dword ptr [rax+rax]  ; <--- multi-byte NOP instruction
...1010 4803c8           add     rcx,rax              ; <--- 16-byte aligned

Removing align 16 and prepending mov rax, 1 with repeated 0x66 ignored bytes:

int   3
db    8 DUP (66h)
mov   rax,1  
add   rcx,rax

generates this disassembly:

...1000 cc                             int     3
...1001 666666666666666648c7c001000000 mov     rax,1
...1010 4803c8                         add     rcx,rax  ; <--- 16-byte aligned

Is the 0x66 alignment technique valid and faster than using align 16?

UPDATE

As suggested, it works using the 0x2E CS segment override prefix. Tested with NASM:

nop
CSbuf: times 8 db 2Eh
mov rax,strict dword 1
add rcx,rax

and the add rcx,rax was 16-byte aligned:

00007ff7`f78d1010 4801c1          add     rcx,rax

Built using these commands:

nasm -fwin64 test.asm
link.exe /subsystem:console /machine:x64 /defaultlib:kernel32.lib /defaultlib:user32.lib /defaultlib:libcmt.lib /entry:main test.obj

IIRC Combining `66` (selecting 16 bit operation) with `REX.W` (selecting 64 bit operation) results in undefined behaviour. Try using a segment override prefix instead? — fuz, Dec 29 '22 at 18:38
[Multi-byte nops](https://stackoverflow.com/questions/25545470/long-multi-byte-nops-commonly-understood-macros-or-other-notation) can be used. I don't know the effect on performance. — rcgldr, Dec 29 '22 at 19:17
`0x66` is not a good example because it would change the behavior of most instructions. For instance, if you had instead coded `mov eax, 1` (`c7 c0 01 00 00 00`), which you normally would, since it's equivalent and shorter than `mov eax, 1`, then prepending `0x66` to get `66 c7 c0 01 00 00 00` decodes to `mov ax, 1` followed by `add [rax], al` which is not what you want at all. — Nate Eldredge, Dec 29 '22 at 20:23
Even if you can arrange to end up with equivalent behavior, `0x66` is a [length-changing prefix](https://stackoverflow.com/questions/65530097/does-a-length-changing-prefix-lcp-incur-a-stall-on-a-simple-x86-64-instruction) on many instructions, such as those which contain immediates, and therefore incurs a performance penalty in decoding. — Nate Eldredge, Dec 29 '22 at 20:27
Near duplicate: [What methods can be used to efficiently extend instruction length on modern x86?](https://stackoverflow.com/q/48046814) is the same basic idea, but not using prefixes like `66h` that could cause LCP pre-decode stalls and even change meaning of some instructions. Yes, it's faster to avoid any actual NOPs if you do it right, that's why Intel recommends lengthening other instructions in inner loops [when working around the performance pothole](https://stackoverflow.com/q/61256646) introduced by their microcode update for the JCC erratum. — Peter Cordes, Dec 30 '22 at 00:41
@NateEldredge: Fun fact: Sandybridge-family doesn't LCP-stall on `mov`. I just tried this, with NASM `times 6 db 0x66` / `mov rax, strict dword 1` inside a `%rep 8000` block (to defeat the uop cache), and it ran at 1.2IPC on Skylake, with no counts for `ild_stall.lcp` in `perf stat`. Even with `add rax, strict qword 1` (to force the `add rax, sign_extended_imm32` no-modrm encoding), Skylake doesn't LCP stall, running at 1.0 IPC (bottleneck on latency). I wouldn't recommend a `66h` prefix, but it happens not to break correctness (with a REX.W) or performance on Skylake. — Peter Cordes, Dec 30 '22 at 00:52

score 3 · Accepted Answer · answered Dec 30 '22 at 07:06

The basic idea is a good one, padding earlier instructions using prefixes is a cheaper way to align than using even one multi-byte NOP. (A NOP takes a slot in the decoders, in the uop cache, and in the issue/rename stage, the narrowest part of the pipeline. Also a ROB entry to track it until retirement).

That's why Intel recommends lengthening other instructions in inner loops when working around the performance pothole introduced by their microcode update for the JCC erratum.

Assemblers always should have been doing this for align directives (or with some other mechanism to specify which instructions to pad), but nobody's been motivated enough until Intel's JCC erratum with an official recommendation to do it this way. (Because unlike aligning tops of loops, the padding may have to be inside an inner loop, where it would cost front-end bandwidth every iteration if it was a NOP instead of part of other instructions.) Unfortunately this JCC-erratum mitigation has mostly been added as a separate feature by assemblers (e.g. GAS's -mbranches-within-32B-boundaries) or left to compilers, not providing a general way to avoid wasting instructions on NOPs for alignment of other points.

Your specific choices aren't ideal

See What methods can be used to efficiently extend instruction length on modern x86? - more than 3 prefixes on one instruction can create big slowdowns for some CPUs, including Silvermont-family like the E-cores in Alder Lake. So spread this out over multiple instructions in the basic block before the position you want aligned. Prefer padding later instructions so the front-end can get more instructions decoded earlier, unless you have a lot of very-short instructions (like 1 or 2 bytes each) that are short enough that a 16-byte fetch block could still include more than 5 or 6. (In anticipation of future CPUs with wide legacy decode, if there aren't any already.)

Using 7-byte mov rax, 1 in an assembler that doesn't optimize it to 5-byte mov eax, 1 is already one way to fill some bytes; 10-byte mov rax, strict qword 1 (NASM syntax; IDK if MASM can force an imm64) is another way to use more bytes. On Sandybridge-family, a 64-bit immediate fits efficiently in the uop cache (1 entry without needing an extra cycle to read it) when the 64-bit value isn't huge, i.e. is just the sign-extension of a 32-bit value. (https://agner.org/optimize/microarchitecture.pdf - Sandybridge chapter)

ds or cs prefixes are a good choice, as they have meaning for most opcodes so it's unlikely that cs mov eax, 1 would be repurposed as the encoding for some different instruction. (e.g. the way rep bsr is the encoding for lzcnt, which does something different.) It's not impossible, especially for the no-modrm mov-to-register opcodes (mov r32,imm32 or mov r64,imm64, unlike the mov r/m64, sign_extended_imm32 you're using. https://www.felixcloutier.com/x86/mov)

I wouldn't recommend using prefixes like 66h that could potentially cause LCP pre-decode stalls on Intel CPUs, and even change meaning of many instructions. (Without REX.W setting the operand-size to 64-bit, it would change the meaning for mov eax, 0x00000001 to mov ax, 0x0001 with a 00 00 left over which decodes as add [rax], al.)

I'm not 100% sure it's well defined on paper what's supposed to happen with both a 66h and a REX.W prefix. (@fuz questioned this in comments). A 67h address-size prefix is generally fine in 64-bit mode, or a segment override prefix is also good in 64-bit mode.

In practice on Skylake, the REX.W prefix wins and the 66h is ignored, and doesn't even cause false LCP stalls. But I wouldn't count on that on P6-family, and if it's not documented on paper what should happen with both 66h and REX.W, I'd worry about other vendors, or especially emulators and dynamic-translation software for that corner case.

Fun fact: Sandybridge-family doesn't LCP-stall on mov in general, but does on other instructions when a 66h prefix changes an opcode from having an imm32 to an imm16.

I just tried this, with NASM times 6 db 0x66 / mov rax, strict dword 1 (to match your encoding; NASM normally optimizes it to the architecturally equivalent mov eax,1). I put that inside a %rep 8000 block (to defeat the uop cache). It ran at 1.2IPC on Skylake, with no counts for ild_stall.lcp in perf stat.

Even with add rax, strict qword 1 (to force the add rax, sign_extended_imm32 no-modrm encoding), Skylake doesn't LCP stall, running at 1.0 IPC (bottleneck on latency). Same for add rcx, strict qword 1 for the imm32 encoding with a ModRM.

I wouldn't recommend a 66h prefix, but it happens not to break correctness (with a REX.W) or performance on Skylake. I didn't test on any other CPUs or emulators, and I'm not claiming this use of 66h is safe anywhere else. (Although it probably is on earlier Intel CPUs, at least for correctness if not performance.)

Using the operand-size override prefix 0x66 for instruction alignment

1 Answers1

Your specific choices aren't ideal