repz ret: why all the hassle?

Question

The issue of the repz ret has been covered here [1] as well as in other sources [2, 3] quite satisfactorily. However, reading neither of these sources, I found answers to the following:

What is the actual penalty in a quantitative comparison with ret or nop; ret? Especially in the latter case – is decoding one extra instruction (and an empty one at that!) really relevant, when most functions either have 100+ of those or get inlined?
Why did this never get fixed in AMD K8, and even made its way into K10? Since when is documenting an ugly workaround based on a behaviour that is and stays undocumented preferred to actually fixing the issue, when every detail of the cause is known?

Thanks for the anonymous downvote, it really helps clarifying this issue. — The Vee, Oct 04 '16 at 23:38
It apparently helps prevent branch mispredictions, which a is pretty significant penalty as these things go, but the actual penalty will vary depending on the circumstances. I'm not sure why you would call the workaround a hassle or ugly, as workarounds go it couldn't be simpler to implement and its not hard to understand. On the other hand fixing the problem in hardware would mean completely redesigning the branch predictor. That wouldn't necessarily be an overall improvement, not without increasing the amount of valuable die space used to implement it. — Ross Ridge, Oct 05 '16 at 01:13
@RossRidge It's ugly because it doesn't reflect the description or purpose of the `rep` prefix. As I read in the other question and its sources, that only allows string instructions, leaving the usage with `ret` a UB. The definition was never updated to reflect (and thus officially justify) what has become a common practice. **A UB which has a known behaviour with major vendors is still a UB.** Also, because it does not take `ecx` into account in any way, although one might expect it behaves different at least for = 0 vs. ≠ 0. `nop` would undeniably be cleaner in all of these respects. — The Vee, Oct 05 '16 at 10:22
Well, no, were not talking about conformance with some official standard here. All x86 compatible CPUs ignore 0xF3 (REP) prefixes on non-string instructions because that's what the original 8086 did. Any CPU that doesn't do this isn't x86 compatible. This is something Intel took advantage of when they created the PAUSE instruction, which is actually REP NOP, and later when they created XACQUIRE and XRELEASE prefixes, which are actually the REP and REPNE prefixes respectively. These are all documented as backwards compatible because they're just hints and older CPUs simply ignore the "hint". — Ross Ridge, Oct 05 '16 at 16:28
@RossRidge Most interesting! So I assume it would clear a lot of confusion if the manufacturers just declared this opcode combination an instruction in its own right – like the link between `rep nop` which makes some sense on its own and `pause` which makes more and allows for further optimizations, causing no harm where it's interpreted as the former – right? It seems to me as a rather close parallel. — The Vee, Oct 05 '16 at 17:07
Unfortunately Intel and AMD don't have a lot of interest in clarifying anything in this area. All the undocumented behaviour that x86 compatible CPU have to implement creates a burden on any other potential competitors. Windows probably won't boot if the CPU doesn't ignore a REP prefix in front of a RET instruction because of its use in `__security_check_cookie` so this is an example of a detail a competitor would have to get right. — Ross Ridge, Oct 05 '16 at 17:46

Johan · Accepted Answer · 2016-10-05T12:14:14.590

Branch misprediction
The reason for all the hoopla is the cost of branch mispredictions.
When a branch comes around the CPU predicts the branch taken and preloads these instructions in the pipeline.
If the prediction is wrong the pipeline needs to be cleared and new instructions loaded.
This can take up to number_of_stages_in_pipeline cycles plus any cycles needed to load the data from the cache. 14 to 25 cycles per misprediction is typical.

Reason: processor design
The reason K8 and K10 suffer from this is because of a nifty optimization by AMD.
AMD K8 and K10 will pre-decode instructions in the cache and keep track of their length in the CPU L1 instruction cache.
In order to do this it has extra bits.

For every 128 bits (16 bytes) of instructions there are 76 bits of additional data stored.

The following table details this:

Data             Size       Notes
-------------------------------------------------------------------------
Instructions     128 bits   The data as read from memory
Parity bits      8 bits     One parity bit for every 16 bits
Pre-decode       56 bits    3 bits per byte (start, end, function) 
                            + 4 bit per 16 byte line
Branch selectors 16 bits    2 bits for each 2 bytes of instruction code

Total            204 bits   128 instructions, 76 metadata

Because all this data is stored in the L1 instruction cache the K8/10 cpu has to spend a lot less work on decode and branch prediction. This saves on silicon.
And because AMD does not have as big a transistor's budget as Intel it needs to work smarter.

However if the code is esp. tight a jump and a ret might occupy the same two byte slot, meaning that there the RET gets predicted as NOT taken (because the jump following it is).
By making the RET occupy two bytes REP RET this can never occur and a RET will always be predicted OK.

Intel does not have this problem, but (used to) suffer(s) from a limited number of prediction slots, which AMD does not.

nop ret
There is never a reason to do nop ret. This is two instructions wasting an extra cycle to execute the nop and the ret might still 'pair' with a jump.
If you want to align use a REP MOV instead or use a multibyte nop.

Closing remarks
Only the local branch prediction is stored with instructions in the cache.
There is a separate Global branch prediction table as well.

I think gcc uses `rep ret` if (and only if) RET can run as the next instruction after a branch. (This includes the case of JNE / RET or something, but also cases where there's no jump next to the RET, and it's just a branch *target*.) — Peter Cordes, Oct 05 '16 at 02:04
That's exactly what I wanted to see, numbers. Thanks! Just a question: how would a `nop ret` pair with a jump? I mean, a *subsequent* jump would not be a problem, as per the logic of GCC, right? I would expect that to perform more or less as well as `rep ret`, provided the decoder knows that there is not much to "execute" in a `nop`. I fail to see why that would decode to anything more than exactly zero micro-operations. — The Vee, Oct 05 '16 at 10:31
@TheVee, the nop still takes up resources that a dummy prefix does not. It cannot decode to zero uops, because it still has to move the instruction pointer and it still has to be retired. A prefix does not have these problems. If you jump into the RET then it can still 'pair' with a jump. If you jump into the preceeding nop then not, but than you're wasting a cycle. — Johan, Oct 05 '16 at 12:18

repz ret: why all the hassle?

1 Answers1

Linked

Related