Data Hazards and stalls

Question

I am studying for my exam tomorrow and I am having difficulty in the below code :

sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)

Due to the ALU-ALU dependency here on Register $2 , The sub instruction does not write its result until the fifth stage, meaning that we would have to waste three clock cycles in the pipeline. My question is why 3 clock cycles ? This dependency can be solved by inserting two nops and therefore we are wasting 2 clock cycles ? Please clarify it to me as I am trying to relate the nops to the wasted cycles and I am sure that I have a huge misunderstanding here .

(This looks like the same ultimate question as [Data hazards and nops insertion](https://stackoverflow.com/q/63523640) - WB to register file in first half of a clock cycle and read in the second, answers both.) — Peter Cordes, Aug 21 '20 at 19:38

Erik Eidt · Answer 1 · 2020-08-21T19:00:21.097

The names of the pipeline stages are somewhat less than standard. More common is to use IF (instruction fetch from Instruction Memory IM), ID (instruction decode and register read), EX (execute/ALU), MEM (Data Memory read or write), and WB (write back register result).

Whether it is 2 vs. 3 clock cycles depends on your internal architecture.

When designed for 3 clock cycle approach, the processor will have to stall for 3 cycles so that the reg write (WB) of the first instruction (cycle 5) fully completes before the reg read (ID) of the second instruction can see those up-to-date values. Without other mitigation, the reg read stage of the second instruction cannot always get guaranteed proper values until clock cycle 6 (whereas without the dependency/hazard, the second instruction would have preferably done reg read (ID) in cycle 3 and was delayed until 6, so 6-3=3 cycle delay).

When designed for 2 clock cycles, that means the the reg write (WB) stage, stage 5 of the first instruction in clock cycle 5, overlaps with the decode (ID) reg read stage of the second instruction also in clock cycle 5.

The reason this works in one of the two ways following:

Half cycles
- The WB stage is very quick, and, its data is ready at the absolute beginning of the clock cycle (nothing has to be computed at all) — it effectively is completed in the first half of cycle 5 (this example).
- The ID stage is slow and gets the data in the second half of cycle 5 — thus, it obtains data that is up-to-date as of reg writes that happen in this cycle.
Register forwarding
- The WB stage's data is available at the absolute beginning of the clock cycle, though WB takes the full cycle.
- The ID stage accesses the register file, but the register file knows about the write happening in this cycle and has an internal forward to ensure that reads read either register values or new values that are being written. The register file effectively has an internal bypass allowing written values to be read in the same cycle.

We speak of these stalls as conditional — if we didn't then every instruction would stall for 2-3 cycles just in case the next subsequent instruction used the ALU result. This would almost (but perhaps not quite) negate the benefit of pipelining.

Let's note that I assert that the conditional logic to detect when we do need a stall due to ALU/ALU dependency is about as complicated as the logic to do a ALU/ALU bypass (because they are the same test). Since the bypass totally mitigates the performance issue, a designer would always prefer the bypass.

The idea of the bypass is that the value needed in the second instruction's EX/ALU stage is actually available somewhere in the CPU — as it has been computed already in the prior clock cycle (that value is just not in the right place). The problem is that the reg read (ID) for the second instruction obtained stale values: refreshing them (ignoring those stale values and taking them directly from the ALU output) when appropriate is the bypass solution.

So, I find it strange to talk about stalls for ALU/ALU dependency/hazard, when mitigating solutions exist (e.g. bypass), I don't think even the original MIPS required software NOP mitigation for an ALU/ALU hazard. (It did for load followed by use, which is a MEM/ALU hazard that requires both a bypass and a stall, the later of which was not provided in the original MIPS, so software had to ensure the use was separated from the load by at least one instruction possibly by inserting a NOP).

The OP's previous question makes clear they're talking about hypothetical MIPS-like pipelines, not actual MIPS. [Data hazards and nops insertion](https://stackoverflow.com/q/63523640). Indeed, real MIPS does bypass forwarding, but exposed a software hazard in the form of load delay slots instead of building hardware to detect it and stall. A hypothetical MIPS without ALU-ALU forwarding would need software to fill those delay slots to avoid hazards, so you don't have to build hazard-detection hardware and then foolishly use it to stall instead of forward. — Peter Cordes, Aug 21 '20 at 19:43
@PeterCordes, thanks! I did not see the prior post, but I get that there is some hypothetical. It would be nice for readers to understand some of these are less than fully sensible. — Erik Eidt, Aug 21 '20 at 20:07
Yeah, looking at exactly how much it sucks *not* to have forwarding (except when there's a lot of fine-grained instruction-level parallelism) makes it clear why all(?) real-world CPUs do. — Peter Cordes, Aug 21 '20 at 20:15

Michael Dorgan · Answer 2 · 2020-08-21T18:03:42.703

0

From https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec12.pdf

Basically, You are losing 2 cycles to the first instruction affected by the hazard (the and) and 1 cycle to the following (the or). The stall here hits the whole pipeline, not just the following instruction.

See pages 8 - 10 of the PDF for a picture of this. Here is a quick ASCII of it:

     1   2   3   4   5  
sub IM  Reg ==> DM  Reg 
and     IM   X   X  Reg 
or          IM   X  Reg

Where the Xs represent stalls. Note that the and and the or are stalled at the 2nd stage of their pipeline awaiting the result from stage 5 of the sub instruction.

edited Aug 21 '20 at 18:03

answered Aug 21 '20 at 17:29

Michael Dorgan

12,453
3
31
61

Many thanks , but why in your ASCII representation you have started the IM stage of the or instruction at the fourth stage ? I think it suppose to start at the fifth stage since the pipeline is stalled at the third and fourth cycle ? – AAA Aug 21 '20 at 17:54
IM starts at the beginning of each cycle per instruction, from what I saw in the pdf. So IM for `or` would be during the "arrow" stage on cycle 3 of the `sub` instruction. So the `or` instruction executes IM on cycle 3, but stalls for 1 cycle on cycle 4 awaiting the results of cycle 5 of the original `sub` instruction. It cannot do its `reg` stage without that result. – Michael Dorgan Aug 21 '20 at 17:58
As an aside, adding 2 NOPs does remove the stall. The first NOP eats 2 cycles of stalls (1 from the `and` and 1 from the `or` instruction), while the 2nd NOP would consumes the 2nd `and`'s stall. Don't think of them as NOPs though, think instead of other valuable instructions that could be done at the same time as long as they don't touch $2. – Michael Dorgan Aug 21 '20 at 18:02
Also made the 3rd stage IM better aligned. ASCII was off by 1 space... – Michael Dorgan Aug 21 '20 at 18:04

Data Hazards and stalls

2 Answers2