What do 'instruction prefixes' mean in modern x86

Question

To get an understanding on why Bulldozer was subpar I've been looking at Agner Fog's excellent microarchitecture book, in it on page 178 under bulldozer it has this paragraph.

Instructions with up to three prefixes can be decoded in one clock cycle. There is a very large penalty for instructions with more than three prefixes. Instructions with 4-7 prefixes take 14-15 clock cycles extra to decode. Instructions with 8-11 prefixes take 20-22 clock cycles extra, and instructions with 12-14 prefixes take 27 - 28 clock cycles extra. It is therefore not recommended to make NOP instructions longer with more than three prefixes. The prefix count for this rule includes operand size, address size, segment, repeat, lock, REX and XOP prefixes. A three-bytes VEX prefix counts as one, while a two-bytes VEX prefix does not count. Escape codes (0F, 0F38, 0F3A) do not count.

When I searched for prefixes I was hit with very technical definitions far and away beyond my abilities. Or, suggested that they were limited to 4 per instruction which conflicts with the above extract.

So in simple terms, can someone explain what they are/do and why you might want to tack on up to 14+ onto an instruction instead of breaking it up?

Very good question. I'd like to read the answers of Peter Cordes and the other experts here. — zx485, Feb 13 '16 at 12:38
@zx485: like everyone is saying, usually you'd only see a large number of prefixes when making a long `NOP`. One `NOP` takes the same time to execute regardless of length, other than code-size side-effects and frontend issues. (As Agner Fog's guide explains). You definitely don't want 14 NOPs wasting space in the uop cache on a CPU that uses a uop-cache. Other than that, well, the x32 ABI often uses address-size prefixes (so base+index*scale addressing modes don't accidentally go outside the 32bit address range). so `lock inc word [edi + r10d*4]` would need 4: lock op-sz addr-sz REX. — Peter Cordes, Feb 13 '16 at 17:58
IIRC, Atom and Silvermont have a similar decoder limitation, but the prefix and escape bytes that are part of the encoding of SIMD instructions count. So they can bottleneck horribly on SSSE3 and later instructions with REX prefixes. — Peter Cordes, Feb 13 '16 at 18:41

score 17 · Accepted Answer · answered Feb 13 '16 at 12:50

Normally you use as many as needed, with the intended instruction and operands determining that. The assembler issues some of the prefixes automatically, while others you get to use manually.

The case they mention is for multi-byte NOP which is traditionally used for alignment padding where the idea is to use a single but appropriately long instruction to conserve resources. Apparently it turns out that using more prefixes just to keep it a single instruction may be worse performer than using two instructions with less prefixes.

The prefix count for this rule includes operand size, address size, segment, repeat, lock, REX and XOP prefixes. A three-bytes VEX prefix counts as one, while a two-bytes VEX prefix does not count.

Examples:

operand size: can switch between 32 and 16 bit registers, e.g. mov ax, [foo] is encoded the same as mov eax, [foo] but with the prefix 66h
address size: can switch between 32/16 or 64/32 bit address sizes, e.g. mov [eax], foo is encoded the same as mov [rax], foo but with the prefix 67h (in 64 bit mode)
segment: can override the segment used, e.g. mov [fs:eax], foo is encoded the same as mov [eax], foo but with the prefix 64h.
repeat: used with string instructions for repeating, e.g. rep cmpsb is the encoded the same as cmpsb but with the prefix f3h
lock: used with certain instructions to make them atomic, e.g. lock add [foo], 1 is encoded the same as add [foo], 1 but with the prefix f0h
REX.W: used to switch to 64 bit operand size, e.g. add rax, 1 is encoded the same as add eax, 1 but with the prefix 48h
REX.R,B,X: used as extensions of the modr/m byte to access extra registers, e.g. add r8d, 1 is the same as add eax, 1 but with the prefix 41h
XOP, VEX: used with vector instruction subsets

score 8 · Answer 2 · answered Feb 13 '16 at 13:03

8

The "four prefixes" deal comes from the "prefix groups":

lock/rep/repne
segment override
operand size override
address size override

You can repeat prefixes, but you cannot (you can, but the behaviour is undefined) use several different prefixes from the same group. Though that only applies to groups 1 and 2, the other groups have only 1 thing in them each.

Something like 66 66 66 66 66 66 66 66 90 is valid (but potentially slow to decode). 2E 3E 00 00 (mixing segment overrides) is not.

Stacking prefixes can be useful for code alignment when the bytes have to be executed, unlike padding with nop it doesn't cost execution time. Using too many at once may cost decoding time.

answered Feb 13 '16 at 13:03

harold

61,398
6
86
164

The question asserts: `instructions with 12-14 prefixes take 27 - 28 clock cycles extra`. How would you encode an OpCode with 12-14 prefixes which are not redundant? I'm just curious for an example. – zx485 Feb 13 '16 at 13:07
3

@zx485 you can't. The "4 prefixes" rule implies that only up to 4 prefixes can be non-redundant, any more than that have to include redundant prefixes. And in any case there are only 11 different legacy prefixes so even with invalid combinations you can't have 12 non-redundant prefixes (unless you add a newfangled REX or VEX but they have different rules). – harold Feb 13 '16 at 13:14
Thanks for your answer. Of course, this would implicate, that any (sane/non-redundant) instruction/OpCode with up to three prefixes will have no penalty above 1 cycle on being processed. That's great news :-) Applying all four prefixes should be exceptionally rare - so being acceptable. – zx485 Feb 13 '16 at 13:22
@zx485 of course, and in 32bit and 64bit code having legacy prefixes at all is fairly rare already, making it even less of a concern – harold Feb 13 '16 at 13:25
It's easy to imagine some Linux x32 code using `lock inc word [edi + r10d*4]`. gcc often uses address-size prefixes for x32, when it isn't sure there's no garbage in high bits of registers. Or when address-math inside an effective-address could produce a different result when not truncated to 32b. The frontend bubble on Bulldozer from doing this might not be a factor because of the `lock`. A similar instruction with a `fs` segment override (for thread-local storage) might be more of an issue, since it could use 4 prefixes without a `lock`, and thus not be super-slow. – Peter Cordes Feb 13 '16 at 18:38

What do 'instruction prefixes' mean in modern x86

2 Answers2

Linked

Related