The pipeline execution diagram with or without a delay slot, with predict-taken

Question

I am working on a problem in the topic of The processors. This problem is in the book whose title is "Computer Organization and Design (6th Edition)". The problem is as follows:

Clearly, this problem is about the branch-taken branch predictor, and the requirement for this problem is to draw the execution diagram for each case of "taken" and "not taken" as being commented in the code.

And here is the answer for the question 4.14.1 and 4.14.2:

The problem lies in the answer. As I read about the prediction for branches in MIPS, I thought that, the diagram would be different. For more details:

In the problem 4.14.1, the first "not taken" conditional branch is fine because there is no delay slot (as an assumption), the next instruction fetch occurs only when EX's stage of the branch is done. Nevertheless, the next conditional branch (beq r3, r0, label1) is immediately followed by another conditional branch (beq r2, r0, label2) which conflicts with the requirements of the problem, I think. Moreover, the sw r1, 0(r2) instruction also follows the conditional branch. Why there are so many conflicts here? For more details, how the branch beq r2, r0, label2 could determine immediately that the following instruction is beq r3, r0, label1even when it does not run into EX's stage which conflicts with the first case that lw r3, 0(r2) has to wait until the EX's of previous instruction is finished.
This problem (problem 4.14.2) is a bit different from the previous one. In this problem, delay slots are used which generates the code as in the answer. It is OK for the first conditional branch (beq r2, r0, label2) which is "luckily" not taken so the lw r3, 0(r2) is not discarded and it continues its execution. However, I can not get the idea that the next conditional branch (beq r3, r0, label1) MUST be fetched in clock cycle 6 instead of 5. Is there anything wrong here?
I found that in the case of not using delay slots, the choice of whether the branch is taken or not is made until the EX's stage is done. It is different in the case of using delay slots which can be done in parallel to the EX's stage of the branch instruction. Is it an appropriate observation? If not, can you give me further explanations of the difference between using and not using delay slots?

I also watched some videos and it did not give me the proper explanation. Therefore, I hope you can help me to understand this problem. I am looking forward to your answers. If you do have any missing information about this problem, please inform me.

1) I don't see the conflicts. The branches you mention are executed in that order if there are no delay slots. — Jester, Aug 05 '21 at 11:47
"which conflicts with the requirements of the problem" -- what are the conflicts? The predictor is wrong on the first branch, so it has to flush the pipeline (this is not illustrated, but its effect is seen by the next instruction starting after EX of the branch). When the predictor is correct, however, it can run instruction after instruction without flushing the pipeline, and that's the point of the predictor. — Erik Eidt, Aug 05 '21 at 11:47
@Jester I have updated my problems in 1 and 2. You can consider my thoughts — Hoang Nam, Aug 05 '21 at 12:29
@ErikEidt The conflicts have been updated in my problems. You can consider in section 1 and 2 in my problems for more details — Hoang Nam, Aug 05 '21 at 12:30
I think it would help you to understand the behavior of a processor without branch prediction: the processor would always wait until after a branch's EX stage to run the next instruction, when it knows for sure exactly which instruction to run next. You would then see the behavior you're expecting. — Erik Eidt, Aug 05 '21 at 15:16
However, with a branch predictor, the processor will continue to fetch and execute instructions, guessing at which alternative is correct, and being forced to cancel & restart if & when the prediction is discovered to be incorrect. This explains the difference between the first branch, which is mis-predicted, and the other branches, which are correctly predicted in the first place. — Erik Eidt, Aug 05 '21 at 15:25
@ErikEidt I still don't get it. So do you mean that the branch predictor will give priority to "taken" prediction over "not taken" prediction? I really look forward to your answer in details for all of my problems. — Hoang Nam, Aug 05 '21 at 15:55
Yes, that's exactly the implementation they have stated and are describing. A properly predicted branch will not incur any penalties, where as a mispredicted branch will require the mispredicted instructions to be flushed/cancelled, and the processor has to restart the instruction stream at the proper address. This is why you see a difference between mispredicted branches and properly predicted branches. In this exercise, the branch predictor is making the simple assumption that the branch will be taken, and so the taken branches incur no penalty, but the not-taken ones do. — Erik Eidt, Aug 05 '21 at 16:01
@ErikEidt Excuse me, If it is exactly the implementation of problem (or even of the branch predictor), I find out another conflict. I understand that if the branch predictor give priority to "taken" prediction, it means that, for example, the branch ``beq r2, r0, label2``, if delay slots are used, will be followed by ``label 2: sw r1, 0(r2)``. Is it correct? If it is correct, the problem 4.14.2 does not work as expected. It now gives priority to "not-taken" prediction. For example, the branch ``beq r2, r0, label2`` is followed by ``lw r3, 0(r2)``. — Hoang Nam, Aug 05 '21 at 16:13
If delay slots are used, then the branch instruction doesn't change the control flow until after executing the 1st instruction after, always, regardless of whether there is a branch predictor or not. Delay slots are not widely used anymore b/c better branch prediction is now available (and micro-architectural implementations vary widely but delay slot only helps some implementations whereas hurts others). However, on a simple processor, the delay slot can cover one of the 2 stalls caused by misprediction, when the actual direction of the branch is not known until EX. — Erik Eidt, Aug 05 '21 at 16:18
You have to appreciate that the use of delay slot totally changes the meaning of a sequence of instructions, and that the programmer has to know whether delay slots are being used or not. — Erik Eidt, Aug 05 '21 at 16:19
@ErikEidt You said " branch instruction doesn't change the control flow until after executing the 1st instruction after". It makes me somewhat confused. So is my observation in the previous comment is true or false? I can not figure out the solution for the conflict that I mentioned with your reply. I think that a complete answer post will help me more. I still look forward to your answer post. — Hoang Nam, Aug 05 '21 at 16:46
Ok, normally (and without delay slots) we expect the next instruction after a branch to be either the next sequential instruction or the branch target, depending on the branch condition. With the delay slot, the 1st instruction after a branch, which is "in the delay slot", fully completes before the choice of whether to branch or not branch. This instruction 1st after is always executed, so never cancelled as part of misprediction. So, the sequence of instructions still depends upon whether the branch is taken or not, but the branch is delayed until the next sequential instruction is done. — Erik Eidt, Aug 05 '21 at 16:51
@ErikEidt: It seems the predictor is magically able to predict taken without latency, so it can fetch the target in the same cycle the branch is being decoded. That's definitely unusual, and would normally require dynamic prediction (to predict the *existence* as well as target of a branch before it's decoded). e.g. [Slow jmp-instruction](https://stackoverflow.com/q/38811901) (that's for x86 with many more pipeline stages between fetch and decode, but the principle still applies). — Peter Cordes, Aug 06 '21 at 00:21
Related: [How does MIPS I handle branching on the previous ALU instruction without stalling?](https://stackoverflow.com/q/56586551) - Without prediction, forwarding from EX to IF requires some trickery. Possibly early decode of branch instructions could have the predicted target ready in the first half-cycle to allow 0 cycle branch latency with predict-taken. (Instead of what MIPS I actually does: a branch delay slot to fully hide latency without prediction.) — Peter Cordes, Aug 06 '21 at 00:23
@ErikEidt: Oh hmm, the pipeline diagram does show some stall cycles in the predicted-taken cases for the no-delay-slot case. But it shows a fetch from `label1` in the cycle after fetching the branch to it. Maybe they mean the `***` as redoing the fetch now that a branch has been detected, but I would have shown the `***` as the first cycle of the next instruction, indicating it had a useless fetch cycle and had to redo IF. IDK if that's what the querent was asking about; I didn't read all the details or comments, on the assumption you had it covered. — Peter Cordes, Aug 06 '21 at 00:29
@PeterCordes, yes I don't know how this predictor works, it predicts the taken branch with apparently zero cycles. The only such simple predictor I can think of would predict not taken so as to simply assume sequential access. If such predictor were doing modern address to address based prediction, why stop at predicting only taken instead of something more dynamic & complex.. Still, all of this will only confuse the OP, given some hypothetical exercise that perhaps has no practical implementation (or would be silly to implement without going a few steps further). — Erik Eidt, Aug 06 '21 at 01:26
@ErikEidt: Yeah agreed that dissecting this magic predictor is off-topic. On closer look, the stall in the branch target `beq` might just be the load delay stall, not a stall of fetching the target, so I guess we should just assume it's magic. — Peter Cordes, Aug 06 '21 at 01:32
@ErikEidt Can you explain to me the problem number 2 I mentioned in the post? I find it hard to explain the implementation when delay slots are used — Hoang Nam, Aug 06 '21 at 03:10
@PeterCordes Do you have any ideas for two remaining problems in my post? I still do not get any ideas to explain them? — Hoang Nam, Aug 06 '21 at 04:48
I think you're trying to learn too many new concepts from a very low & complicated level. You should try to learn about how delay slots work independently of pipelining & branch prediction complications. Then once you've learned that, come back to internal processor implementation. — Erik Eidt, Aug 06 '21 at 14:53

The pipeline execution diagram with or without a delay slot, with predict-taken

0 Answers0