1

In which RISC pipeline stage is branch decision been made? Is it in the "Decode" or "Executes" or other stages? Assume the pipeline have 5 stages - "IF", "ID", "EX", "MEM" and "WB".

Jonas
  • 121,568
  • 97
  • 310
  • 388
user3489985
  • 327
  • 4
  • 18
  • @Jonas: I think we should re-add the `[risc]` tag here because there is such a thing as https://en.wikipedia.org/wiki/Classic_RISC_pipeline which is common to multiple classic RISC ISAs, and which this question is specifically asking about. My answer here focused on MIPS because it's the classic RISC that survived the longest in more university teaching use-cases, and in embedded stuff. (But unfortunately my answer is wrong! fixing it...) – Peter Cordes Jan 04 '21 at 10:30
  • @PeterCordes there are many opinions. My thought here is that what you describe is in the topic [cpu-architecture] and that not every thing is worth its own tag. People that are experts on cpu-architecture will know about pipelines within the CPUs, that was my thought. – Jonas Jan 04 '21 at 10:35
  • @Jonas: Ah, good point. Agreed we don't need a tag for `[classic-5-stage-risc]`; cpu-architecture is low-traffic enough to encompass that meaning. I was hesitant to re-add `[risc]` here so only commented about it instead of doing it; apparently my instincts agreed with you before I understood why. :P – Peter Cordes Jan 04 '21 at 10:41

1 Answers1

4

There are a few ways to implement this in a classic 5-stage RISC in general. For unconditional direct (not register) branches, obviously you can detect them in ID and have the target PC ready for the next IF cycle (with 1 cycle of branch latency, i.e. 1 wasted IF cycle if you don't hide that latency somehow, e.g. MIPS's branch delay slot or branch prediction).

Some toy pipelines like described in this answer do the simplest thing and evaluate in ALU in EX, forwarding to a muxer between PC+4 and PC+4+rel_offset and eventually on to IF with 3 cycle branch latency. (End of EX to start of IF)

Actual commercial MIPS I (R2000) evaluated branch conditions in the first half-cycle of EX, forwarding to IF which only needed an address in the second half-cycle. See How does MIPS I handle branching on the previous ALU instruction without stalling? This gives a branch latency of 1 cycle, short enough to be fully hidden by 1 branch-delay slot, even for conditional or indirect jr $reg branches.

This half-cycle speed is why MIPS branch conditions are simple, only checking the whole register for non-zero or not, or checking the MSB (sign bit) for non-zero. Simple RISCs with a FLAGS / status register (like PowerPC or ARM) could use a similar strategy of very quickly checking a flags condition.

(Note that RISC-V allows a full set of branch conditions; as described in RISC-V's design rationale, checking a whole register for all-zeros in modern CMOS designs is apparently not much shorter gate-delay than comparing two registers for equality or even > or < with a good comparator, presumably something smarter than subtract with ripple-carry. RISC-V assumes branch-prediction will hide branch delays.)


The previous version of this answer incorrectly claimed that MIPS I evaluated branch conditions in ID itself. A toy pipeline in this question does that, but that would require the inputs to be ready earlier than usual. It introduces the problem of a b?? instruction stalling while waiting for the EX result of the previous ALU instruction, like in common sequences like slt $at, $t1, $t2 / bnez $at, target, i.e. the expansion of a pseudo-instruction like blt $t1, $t2.

Wikipedia's Classic RISC (5-stage pipeline) article's Instruction Decode section was misleading at best, but has been fixed. It now says "The branch condition is computed in the following cycle (after the register file is read)" - I think that was a bugfix, not just clarification: this is all described in the ID section, implying it happened there without explicit phrasing to the contrary. Also, the still-present claim that "Some architectures made use of the Arithmetic logic unit (ALU) in the Execute stage, at the cost of slightly decreased instruction throughput." makes no sense if it wasn't talking about evaluating them earlier, since nothing else could be using the ALU during that time in a scalar in-order pipeline.


Other sources (like these slides: http://home.deib.polimi.it/santambr/dida/phd/wonderland/2014/doc/PDF/4_BranchHazard_StaticPrediction_V0.pdf) says "Branch Outcome and Branch Target Address are ready at the end of the EX stage (3th stage)" for a classic MIPS beq instruction. That's not how commercial R2000 worked, but may be describing a simple MIPS implementation from a textbook or course material that does work that way.

Much discussion of MIPS is actually about hypothetical MIPS-like 5-stage RISC pipelines in general, not real MIPS R2000, or the classic Stanford MIPS CPU that R2000 was based on (but it was a full re-design). So it's hard to know whether something you find about "MIPS" applies to R2000 (gcc -march=mips1) or if it's for a simplified teaching version of MIPS.

Some "MIPS" implementations aren't even the same ISA, e.g. without branch-delay slots (which complicate exception handling significantly).


This originally wasn't a MIPS question at all, just generic classic 5-stage RISC. There were multiple early RISC ISAs, many of them originally designed around a 5-stage pipeline (https://en.wikipedia.org/wiki/Classic_RISC_pipeline). I don't know a lot about their internals:

Different architectures could make different choices, e.g. stall or use branch prediction + speculative fetch/decode if needed while they wait for the branch result to be ready from whatever stage produces it.

And even speculative execution is possible, even with a static prediction like forward not-taken / backward taken. If still in-order, mis-speculation can be caught before it reaches write-back or MEM. You don't want any speculative stores written to cache, but you can definitely catch it by the time the branch reaches EX. All instructions which have a control dependency on the branch are younger and therefore are in earlier pipeline stages (if present at all; IF could have missed in I-cache).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847