7

This is the second time I'm asking this question; the first time someone did reply but I took too long to reply back to them and therefore didn't get the full understanding.

What I'm trying to do is learn more about the instruction fetch parts of modern architectures; to which I assume all instructions are predicted by the branch predictor for the instruction fetch unit to fetch as per prediction.

The other gentleman who attempted to helped mention something about a "branch instruction" also being sent along with the predicted instruction. This "branch instruction" tests the condition of the branch predictor's prediction as to whether it was correct or not. I also assume that these branch instructions go to the branch execution unit, and DONT require any loads from memory.

What I don't understand is:

  • How does the branch execution unit know whether the guess was correct or not with this instruction?
  • What happens once it knows it is correct?
  • Does a branch instruction get issued EVERY prediction (basically meaning... EVERY time any prediction is made?)
  • Must a branch prediction go before or after the predicted instruction?
  • Does a branch instruction require any data loaded from memory? If so, what is it?

Thanks!

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
elemein
  • 197
  • 6
  • Why don't you start by reading the hardware manufacturer's manuals? For example, Intel provides both an excellent hardware manual and a separate optimization manual that explains their hardware's behaviour in great detail. – Kerrek SB Jul 20 '14 at 17:27
  • I have looked around Intel's (and other) documentation for the answer to this, but haven't been able to find it... :/ If you know a better place to look, please do point me in the right direction though! Thanks! – elemein Jul 20 '14 at 17:29
  • Well... what's unclear? The guess is confirmed once the value is available. If the guess was correct, the executed instructions are committed (their results are only stored locally until then). 3 is the other way round, a branch instruction *causes* a prediction to be made. 4 before, obviously, and there isn't a "predicted instruction", but rather a "predicted branch". 5 No, the BP has its own dedicated on-chip memory. The whole point of branch prediction is to allow execution *before* expensive memory loads complete. – Kerrek SB Jul 20 '14 at 17:32
  • Oh, so from what I am understanding, is that anytime a branch is encountered (if statements, for statements, etc. etc.), the branch predictor kicks in and predicts ahead of time. However, if a branch is not present in a chunk of code, it'll just run through the code in series, completely bypassing the predictor? 5,4 are both answered. Thanks! What is unclear is "How does the branch execution unit know if a branch is correct or not?" How does it know the conditional value? And you're saying that this happens BEFORE a branch is predicted? – elemein Jul 20 '14 at 17:45
  • On x86, branch prediction affects *conditional jump instructions*. Those are conditional on one of the flags. So the prediction is verified once the flag value is available. – Kerrek SB Jul 20 '14 at 17:50
  • Would it be alright to ask for a pseudo example so that I can get a bit of a better grip on what you mean? – elemein Jul 20 '14 at 17:55
  • E.g. `cmp eax, [esi]; je 0x20;` The branch at `je` is predicted and the predicted branch's code starts executing, and the prediction is verified once the fetch from `esi` and the comparison complete. – Kerrek SB Jul 20 '14 at 18:38
  • Mm, I think I have a handle on it. So the branch predictor is only used when there is a branch to be predicted. Otherwise, code continues in serial order. Once a branch comes up, the branch predictor predicts a path. The CPU then loads instructions from the predicted jump location, and the branch has begun issuing; along with a branch instruction. The branch instruction proceeds first (along with the other predicted instructions). This instruction verifies that the condition is met. If so, the branch continues and the BP is updated. If it is incorrect... I don't know after this. – elemein Jul 20 '14 at 18:42
  • The key issue to realize is that instructions are generally executed *out of order*. So the `je` instruction may be executed *before* the `cmp` instruction completes, or even before it is issued. This behaviour is crucial to modern CPUs. It's not the high clock rate that matters, but clever ways of executing instructions. (All this is in the optimization manual.) – Kerrek SB Jul 20 '14 at 18:48
  • Looking around for optimization manuals, I can only find Agner's manual and AMD's 10h manual. Both of which unfortunately don't answer my question. Is the je result committed before the cmp instruction completes, or do they operate independently completely? If that is so, isnt it entirely possible for valid data values to be overwritten with bad values from mispredicted branches? – elemein Jul 20 '14 at 18:52
  • https://www.google.com/search?q=intel+optimization – Kerrek SB Jul 20 '14 at 19:04
  • Looking through the IA-32 + 64 Optimization Guide, the guide does not outline the branch prediction procedure, and outlines mainly how to improve speed when branch prediction is necessary. Though it's more than possible I missed what you're outlining. – elemein Jul 20 '14 at 19:10

1 Answers1

12

I think to answer your question you first need to understand how branch prediction works. To explain that it's necessary know why it makes those predictions in the first place. That requires an understanding of how the pipeline in modern processors works. The best way for me to explain that is to start with a how non-pipelined CPU process instructions.

(Bear with me if I go over things you already know. It's not entirely clear where your confusion regarding branch prediction stems from.)

Older CPUs, along with many simple modern ones, process instructions in a fairly straightforward and obvious way. They break up the actions necessary to execute an instruction into a series of steps. Each instruction is run through these steps and once it's done all of them it moves on to the next. For example, a hypothetical simple non-pipelined CPU might follow a series of steps like this:

  1. Fetch the instruction into memory.
  2. Read in the operands of the instruction.
  3. Execute the operation for that instruction.
  4. Write the result

It's relatively simple to implement a CPU this way, but it makes very inefficient use of the processor's resources. While the CPU is performing one step in process of executing and instruction, all the silicon used to implement the other steps is siting idle.

Modern pipelined processors try to be more efficient by turning the sequence of steps necessary to execute instructions into something like an assembly line. Instructions go through a series of steps, or stages are they're called, just like in a non-pipelined CPU. The difference is that once an instruction is has cleared the first stage of the pipeline, the CPU can send another instruction down. This allows several instructions to be in the pipeline at once, hopefully keeping all the parts of the chip well utilized. While an instruction still needs to go through a number of different stages, ideally instructions come out of the pipeline quickly one after each other. Pipelining doesn't improve the time to execute an instruction from start to finish, instead it shortens the time between instructions completing.

Now this a rather simplistic description of a modern pipeline, one that glosses over a number issues complicating the design of a modern pipelined CPU. However, as far branch prediction goes there's only one complication that needs to be addressed: what to do when executing a branch instruction.

First off, nothing special needs to be done with the branch instruction itself. It can be tossed down the pipeline like any other. Once it's cleared the first stage the CPU can then send down the next instruction - and therein lies the problem. What is the next instruction? Since it's a branch it could go two different ways, and the CPU won't know which until the branch instruction finishes its journey through the pipeline.

The simple thing for the processor to do is wait. Since no other instructions are being fed into the pipeline while it waits, the pipeline empties out. Only when the branch instruction exits the (now empty) pipeline can the CPU resume putting instructions in. That next instruction has to go through all of the stages of the now empty pipeline, resulting in a delay between when the branch instruction finishes and the next instruction does. In this case the instructions don't exit the pipeline in quick succession as ideally they should.

This delay can be quite big on modern processors. It's a function of the number stages in the pipeline, basically one cycle per stage. Most modern x86 CPUs have around 15 stages in their pipelines, so it would be very costly to implement branches that way. A simple CPU with a really short pipeline might be able to get away with always waiting, but modern processors have to do something else. They make a prediction.

The simplest form of prediction is a static branch prediction. The processor looks at only the branch instruction itself to guess whether it will be taken or not. The simplest form of static prediction is to assume that all branches are not taken, as this is the case more often than not. A more advanced static predictor assumes branches going backwards are taken and ones going forward are not. This makes the assumption that backwards branches are loops and loops are usually executed more than once.

Static prediction can work well, but it's still going to make a lot of bad predictions. You can improve on this by employing some sort of dynamic branch prediction. There's all sorts of different ways this can be done, too many to mention here, but they all make a guess based on the history of which way previous branches have gone.

However the CPU ends up making the prediction, it proceeds as if it had made the correct guess. After the branch instruction is put in the pipeline, it starts sending down the instructions it assumes will be executed. This is called speculative execution, and the instructions processed like this are tagged as being speculative. This lets the pipeline know it shouldn't do anything with these instructions that can't be undone.

When the branch instruction gets to the end of pipeline the CPU will now know whether its guess was correct. If it made the right prediction, it doesn't need to do anything, it can let the previous instruction finish the journey through the pipeline. Because it guessed correctly the branch has no additional cost.

If it guessed wrong it has to undo anything the speculatively executed instructions might have done, clear the pipeline and begin sending down the instructions it was supposed be executing. This results in the same delay as if it hadn't made a prediction at all.

So the answer to your question "How does the branch predictor know if it's not correct?" is that it doesn't know and doesn't care if it makes correct predictions or not. It just makes predictions. If it's a dynamic branch predictor then it will record whether the branch was taken or not taken in its history, but it won't record whether it made the correct decision.

Toby Speight
  • 27,591
  • 48
  • 66
  • 103
Ross Ridge
  • 38,414
  • 7
  • 81
  • 112
  • Your explanation does piece together many bits of the puzzle very well. One tiny little question though; let's say that a speculated instruction is fetched, decoded, executed AND stored before the branch instruction is done. If the branch instruction turns out that the prediction is incorrect, is the result stored by the preceding instruction changed to it's former value (by some mechanism I assume?), or is it left there to mess things up... Or is there a mechanism in place that doesnt allow speculated instructions to finish ahead of the branch instruction? – elemein Jul 21 '14 at 02:33
  • The instructions are tagged as being speculative so the various stages of the pipeline won't do anything with that instruction that the processor can't undo. If it needs to do something like write to memory (which can't be undone, the write will be visible to other processors) then the speculative instruction stalls in the pipeline while it waits for the branch to complete execution. Then it will either have the speculative tag removed and proceed normally or it will be removed from the pipeline. – Ross Ridge Jul 21 '14 at 03:13
  • Thanks! This answers my question completely! – elemein Jul 21 '14 at 04:46
  • This is maybe the best answer to a question on SO, I have ever read. So well written. Props to you, Ross Ridge. – radfast Dec 13 '20 at 15:29
  • @elemein: *all* instructions have to be treated as speculative for out-of-order exec; any instruction could fault, or an interrupt could arrive that we want to handle before waiting for all the in-flight instructions to complete. In-order retirement recovers a consistent state as-if instructions had executed in program order, with support for precise exceptions (being able to roll back to a consistent state for any faulting instruction or mis-predicted branch). – Peter Cordes Sep 29 '21 at 14:55
  • However, to avoid stalling on every store, CPUs use a store buffer so "executing" a store just means writing address+data into the store buffer; commit happens later, once the store is known to be non-speculative (i.e. after it retires). [Can a speculatively executed CPU branch contain opcodes that access RAM?](https://stackoverflow.com/q/64141366). Also related: [What exactly happens when a skylake CPU mispredicts a branch?](https://stackoverflow.com/q/50984007) – Peter Cordes Sep 29 '21 at 14:56