1

Currently working on understanding MIPS architecture and assembly language, I've been asked to place NOPs in the following assembly code:

1 ADD R2,R1,R3
2 SW R3,0(R2)
3 ADD R4,R3,R2
4 LOOP: LW R8,2000(R4)
5 SUB R5,R4,R8
6 ORI R23,R8,370
7 ADD R19,R5,R8
8 LW R7,1000(R16)
9 SW R7,2000(R16)
10 LW R15,1000(R7)
11 SW R14,2000(R15)
12 SUBI R15,R15,99
13 AND R6,R14,R15
14 SLT R21,R11,R28
15 BEQ R21,R6,LOOP

I've placed the NOPs according to how I've been taught, and got this:

1        ADD R2,R1,R3
            3 NOPs
2        SW R3,0(R2)
3        ADD R4,R3,R2
            3 NOPs
4        LOOP: LW R8,2000(R4)
              3 NOPs
5        SUB R5,R4,R8 
6        ORI R23,R8,370
            2 NOPs
7        ADD R19,R5,R8
8        LW R7,1000(R16)
            3 NOPs
9        SW R7,2000(R16)
10       LW R15,1000(R7)
            3 NOPs
11       SW R14,2000(R15) 
12       SUBI R15,R15,99 
            3 NOPs
13       AND R6,R14,R15 
14       SLT R21,R11,R28 
            3 NOPs
15       BEQ R21,R6,LOOP

This feels awfully off to me, I believe there is a way to place less NOPs and still handle all RAW data hazards in this code.

EDIT:

I've been able to reduce the number of NOPs by 3, after seeing that line 2 and 3 has no dependencies, due to SW not chaning the value stored in R3.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Aishgadol
  • 147
  • 6
  • 1
    What kind of MIPS pipeline? 5-stage [classic RISC](https://en.wikipedia.org/wiki/Classic_RISC_pipeline) but without bypass forwarding? (Or I guess it's still a RAW hazard even if forwarding does handle it.) Can you reorder the instructions instead of just adding NOPs to show where this ordering would stall without forwarding? – Peter Cordes Dec 25 '22 at 11:11
  • A superscalar out-of-order MIPS like [R10000 (4-wide pipeline)](https://en.wikipedia.org/wiki/R10000) could see RAW dependencies over longer distances, and just a few NOPs wouldn't help. Architecturally, MIPS only defines a load delay slot (in MIPS I, removed in MIPS II) of 1 instruction after a load. Anything else depends on pipeline details that you didn't specify. – Peter Cordes Dec 25 '22 at 11:14
  • Apologies for not mentioning the specifics. Yes, The classic RISC 5-stage MIPs. I am unsure about bypass forwarding, I assume this subject is yet to be taught, or perhaps wont be taught at all. Reordering the instructions (rescheduling if i'm correct) is the following question, I'm asked to place NOPs in this code and then, reschedule the code in such a way that the number of NOPs is to be minimal. – Aishgadol Dec 25 '22 at 11:15
  • Yes, reordering to avoid stalls (or hazards) is called instruction scheduling. Bypass forwarding is what lets real MIPS CPUs not stall all the time when running a sequence of instructions on the same integer value, like is very common in real code: the output of the ALU can forward back into the ALU for the next cycle. ([Including slt/beq](https://stackoverflow.com/questions/56586551/how-does-mips-i-handle-branching-on-the-previous-alu-instruction-without-stallin)) The ID stage detects this, tracking which registers have been written by recent instructions so it can set up this forwarding. – Peter Cordes Dec 25 '22 at 11:19
  • Presumably the point of this exercise is to show you that CPUs would have tons of stalls that are hard to schedule around if they didn't have bypass forwarding, like only highly unrolled loops could spend most of their cycles doing work instead of being stalled. https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing – Peter Cordes Dec 25 '22 at 11:20
  • Regarding the way I've placed the NOPs in the code above, do you see a way I can reduce the number of NOPs? (keeping the same order of instructions) – Aishgadol Dec 25 '22 at 11:24
  • No, looks to me like all those NOPs are required, since 3 cycles after decoding one instruction is when it writes back in a classic 5-stage RISC. That's the cycle another instruction can enter ID and read the value from the register file. Like I said, it's normal for instructions to use the result of the previous one, especially if it wasn't a load, that's why real CPUs need to avoid having to stall like this to not suck. (Also, this code is intentionally badly scheduled (especially store right after load), to give you something to do in the next step of the assignment.) – Peter Cordes Dec 25 '22 at 11:39
  • Indeed, thank you very much for confirming my solution's legitness! Right now I'm working on rescheduling it, I'll update the question later with the result I've got. – Aishgadol Dec 25 '22 at 11:48
  • Why 2 NOPs between 6 & 7? – Erik Eidt Dec 25 '22 at 15:08
  • 1
    @ErikEidt Line 5 writes to R5, line 7 reads from R5 to write to R19, there is a dependency. – Aishgadol Dec 25 '22 at 15:39
  • @Aishgadol, oops, missed that.. – Erik Eidt Dec 25 '22 at 18:41
  • @PeterCordes I cant seem to find a way to efficiently reduce the number of NOPs by rescheduling, most I've been able to reduce is around 5-6 NOPs, which feels to me like alot more could be done. If you were to reschedule the MIPS assembly code I've shown above, what would you do? – Aishgadol Dec 26 '22 at 08:13

1 Answers1

2

This feels awfully off to me, I believe there is a way to place less NOPs and still handle all RAW data hazards in this code.

Looks to me like all those NOPs are required, since 3 cycles after decoding one instruction is when it writes back in a classic 5-stage RISC. That's the cycle another instruction can enter ID and read the value from the register file without any special detection of the RAW hazard to set up bypass forwarding or to stall the pipeline.

The cases of 2 NOPs look correctly placed where there's one independent instruction between an instruction and the one consuming its result.

It's normal for instructions to use the result of the previous one, especially if it wasn't a load, that's why real CPUs need to avoid having to stall like this to not suck. e.g. first-gen commercial MIPS R2000 had bypass forwarding, with no stalls for anything except cache-miss load/store, or mult / div. An instruction could use the result of the previous instruction with no penalty.

It didn't even stall for loads, instead software had to respect the "load delay slot", not using a load result until 1 instruction later (otherwise there was no guarantee whether you'd see the old or new register value, depending on cache miss!). Later MIPS revisions changed that, since compilers often couldn't fill it with anything except NOP, and that hurt I-cache density. The branch delay slot was another matter; it couldn't be changed, but was another part of what let classic MIPS hide branch latency and not need branch prediction, forwarding from the first half-cycle of beq to the 2nd half-cycle of the IF stage.

(Also, this code is intentionally badly scheduled (especially store right after load), to give you something to do in the next step of the assignment. It's also totally artificial, computing some results that aren't used inside the loop and could just be done after (aka sunk out of the loop), or done before (hoisted) like 14 SLT R21,R11,R28 which doesn't depend on any of the other instructions. But that would make scheduling "harder", because these useless instructions work as filler like NOPs.)

Presumably the point of this exercise is to show you that CPUs would have tons of stalls that are hard to schedule around if they didn't have bypass forwarding. Mostly only highly unrolled (and software-pipelined) loops could spend most of their cycles doing work instead of being stalled. https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing

I cant seem to find a way to efficiently reduce the number of NOPs by rescheduling, most I've been able to reduce is around 5-6 NOPs, which feels to me like a lot more could be done. If you were to reschedule the MIPS assembly code I've shown above, what would you do?

There are a few WAW / WAR hazards that prevent reordering; if we could use different registers as well as reordering that would open up more possibilities. e.g. SUBI R15,R15,99 could subtract into a temporary so it could be done before / in parallel with SW R14,2000(R15). Or the store could use 2099(R15) in the addressing mode to offset the subtract. (See How to unroll a loop of a dot product in mips after re-ordering instructions? for an example of that kind of optimization).

I just realized I haven't been considering the possibility of memory aliasing when reordering loads and stores wrt. each other. e.g. if SW R7,2000(R16) / LW R15,1000(R7) could be a reload of that store, scheduling the store later would change the program. Let's pretend that we can assume no aliasing.

After looking at the dependency chains and using that to figure out which instructions to do earliest (starts of the longest dep chains first):

## Without considering memory aliasing, reordering some stores after loads.

1        ADD R2,R1,R3
            3 NOPs
3        ADD R4,R3,R2  # dep on 1 (R2).  R4 is read-only after this, in the loop
2        SW R3,0(R2)   # dep on 1 (R2)
            1 NOPs    # put this NOP outside the loop because 4 only needs to stall on the first iter.
    LOOP: 
8        LW R7,1000(R16)       # no deps on anything in the shown code
4        LW R8,2000(R4)     # dep on 3 (R4) before loop.  R8 is only written here.
14       SLT R21,R11,R28    # independent of everything.
              1 NOPs
10       LW R15,1000(R7)       # dep on 8 (R7)
5        SUB R5,R4,R8          # dep on 4 (R8).  (which also depends on the same R4 value from 3 we read)
9        SW R7,2000(R16)       # dep on 8 (R7).  R16 is read-only
6        ORI R23,R8,370        # dep on 4 (R8).  R23 is write-only
11       SW R14,2000(R15)      # dep on 10 (R15).  R14 is read-only
12       SUBI R15,R15,99       # dep on 10 (R15), WAR on 11
7        ADD R19,R5,R8         # dep on 5 (R5), and 4 (R8) indirectly via 5 as well as directly.  R19 is write-only
            2 NOPs
13       AND R6,R14,R15        # dep on 12 (R15).  R14 is read-only
            3 NOPs
15       BEQ R21,R6,LOOP       # dep on 14 (R21), 13 (R6)

6 NOPs inside the loop, down from 17. (And saved 2 NOPs before the loop, too, which helps for code size / I-cache footprint.)

This may not be optimal, but I don't see any more movements of single instructions that would save anything. Converting 11 SW R14,2000(R15) to SW R14,2099(R15) would allow 12 SUBI R15,R15,99 to start a cycle sooner, and the store could fill one of the stall slots near the tail of the loop. And/or subi into a different register would allow the sub to run first, allowing the same benefit without lengthening the dependency chain.

Dependency chain involving R15: 8 LW R7,1000(R16) -> 10 LW R15, 1000(R7) -> 12 SUBI R15,R15,99 -> 13 AND R6,R14,R15 -> 15 BEQ R21,R6,LOOP. With 11 SW R14,2000(R15) also reading the R15 load result, before SUBI.

Dep chain starting with 4 LW R8,2000(R4): nothing writes R4 in the loop, so this load could just get hoisted out of the loop, unless some other pointer aliases the same address. Multiple things read R8, but only the load writes it. From those reads of R8:

  • 6 ORI R23,R8,370 is a dead end, nothing reads R23 (so we should sink it out of the loop in case something after the loop needs R8 | 370 in R23 later; we know nothing in the loop does.)

  • 5 SUB R5,R4,R8 -> 7 ADD R19,R5,R8 dead ends there; nothing reads R5 or R19 so those could similarly be sunk out of the loop. (R4 and R8 will still have the same value after the loop.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • This is an incredible explanation! Is the code (after rescheduling) you've displayed in your comment only relying on the original code without any changes to the code itself? I agree that some changes (for example `11 SW R14,2000(R15)` to `SW R14,2099(R15)` ) are indeed able to reduce times even further. – Aishgadol Dec 26 '22 at 15:03
  • 1
    @Aishgadol: The code block doesn't have any changes to any instructions, just reordering of whole lines. – Peter Cordes Dec 26 '22 at 16:32