Why is there a connection between branch prediction failure and "rep ret" in the K8 processor?

Question

I am currently looking for answers to why gcc generates strange instructions like "rep ret" in the generated assembly code. I came across a question on Stack Overflow where someone raised a similar question: text. In the answers provided, someone shared a detailed article explaining the origin of "rep ret": text. I have read the article and tried to understand the explanation for my own question. However, while reading the article, I became even more confused. For example, the article states:

The way the K8′s branch predictor works, there are 3 branch predictor entries (branch “selectors”) for every 16 bytes code block, shared by 9 (one every odd byte, plus one for byte 0) branch “indicators”. The branch predictor is linked to the cache, and the 16 bytes code blocks are grouped by 4, which is the size of a cache line (the granularity of the lowest-level cache in a CPU). Branch indicators are two bits, encoding whether the branch is never taken (0), or which selector to use (1, 2, or 3). The branch selector knows whether the branch is a return or a call, but more importantly the branch prediction: never or always jump, or if neither, more advanced information. Obviously, what looks like the best thing to do is to put at most 3 branch instructions per 16 bytes of code, or even better, only one. And this is what advice 6.1 of the K8 optimization guide tells us. However, to return to our single-byte ret, the problem is that it is the only single-byte branch instruction. If a ret is at an odd offset and follows another branch, they will share a branch selector and will therefore be mispredicted (only when the branch was taken at least once, else it would not take up any branch indicator %2B selector). Otherwise, if it is the target of a branch, and if it is at an even offset but not 16-byte aligned, as all branch indicators are at odd offsets except at byte 0, it will have no branch indicator, thus no branch selector, and will be mispredicted.

It seems that the explanation about the working mechanism of the K8 processor and the reasoning derived from it is difficult to imagine. Is there anyone who can provide a simpler and easier-to-understand explanation?

After reading this article text, I still don't understand how the K8 processor performs branch prediction, and I don't grasp why there is a connection between branch prediction failure in the K8 processor and the single-byte "ret" instruction.

That quote is a more detailed explanation than I remember seeing previously, thanks for linking it. It makes sense to me, and I don't know how I'd re-state it other than dropping tons of details. Like "AMD K8/K10 can't predict adjacent 1-byte branches, and doesn't like having too many branches in a 16-byte block". [repz ret: why all the hassle?](https://stackoverflow.com/q/39863255) has an explanation that touches on a couple of these points, but omits a lot of that detail. — Peter Cordes, May 19 '23 at 09:19
`rep ret` makes the instruction longer at no performance disadvantage and this way avoids this potential misprediction. — fuz, May 19 '23 at 09:59
Think of branch prediction as a function of the instruction address. If all branch instructions are at least 2 byte, the hardware can ignore the lowest bit. `ret` is the only branch instruction that is 1 byte, thereby confusing the predictor since it may share its prediction with a second instruction. `rep ret` is 2 byte, so it prevents this from happening — Homer512, May 19 '23 at 10:14

Why is there a connection between branch prediction failure and "rep ret" in the K8 processor?

0 Answers0

Linked