I am currently looking for answers to why gcc generates strange instructions like "rep ret" in the generated assembly code. I came across a question on Stack Overflow where someone raised a similar question: text. In the answers provided, someone shared a detailed article explaining the origin of "rep ret": text. I have read the article and tried to understand the explanation for my own question. However, while reading the article, I became even more confused. For example, the article states:
The way the K8′s branch predictor works, there are 3 branch predictor entries (branch “selectors”) for every 16 bytes code block, shared by 9 (one every odd byte, plus one for byte 0) branch “indicators”. The branch predictor is linked to the cache, and the 16 bytes code blocks are grouped by 4, which is the size of a cache line (the granularity of the lowest-level cache in a CPU). Branch indicators are two bits, encoding whether the branch is never taken (0), or which selector to use (1, 2, or 3). The branch selector knows whether the branch is a return or a call, but more importantly the branch prediction: never or always jump, or if neither, more advanced information. Obviously, what looks like the best thing to do is to put at most 3 branch instructions per 16 bytes of code, or even better, only one. And this is what advice 6.1 of the K8 optimization guide tells us. However, to return to our single-byte ret, the problem is that it is the only single-byte branch instruction. If a ret is at an odd offset and follows another branch, they will share a branch selector and will therefore be mispredicted (only when the branch was taken at least once, else it would not take up any branch indicator %2B selector). Otherwise, if it is the target of a branch, and if it is at an even offset but not 16-byte aligned, as all branch indicators are at odd offsets except at byte 0, it will have no branch indicator, thus no branch selector, and will be mispredicted.
It seems that the explanation about the working mechanism of the K8 processor and the reasoning derived from it is difficult to imagine. Is there anyone who can provide a simpler and easier-to-understand explanation?
After reading this article text, I still don't understand how the K8 processor performs branch prediction, and I don't grasp why there is a connection between branch prediction failure in the K8 processor and the single-byte "ret" instruction.