How can I rearrange MIPS code to minimise the number of NOPs needed, by hand?

Question

I came across the following MIPS code which has data and load use hazards.

0 addi $t0,$a0,4
1 addi $t1,$a1,4
2 sub  $t2,$t0,$t1                #data hazard $t0, data hazard $t1
3 sll  $t3,$a2,2  
4 add  $t4,$t0,$t3                #data hazard $t3
5 add  $t5,$t1,$t3                #data hazard $t3
6 sw   $t2,0($t4)                 #data hazard $t4

To resolve the hazards I could add 5 NOPs (2 after line 1, 2 after line 3 and 1 after line 5), or I could rewrite the code thus avoiding the hazards altogether. Rearranging the code, in order to minimise the number of NOPs, I get:

0 addi $t0,$a0,4
1 addi $t1,$a1,4
3 sll  $t3,$a2,2
2 sub  $t2,$t0,$t1                #data hazard $t1
4 add  $t4,$t0,$t3                
5 add  $t5,$t1,$t3                
6 sw   $t2,0($t4)                 #data hazard $t4

The two hazards would then be resolved by adding 1 NOP after line 3 and 1 NOP after line 5. However, this code is relatively simple and short. What if I am given 20 lines of MIPS code in the exam? Is there a faster way or rule that can make rearranging the the code easier and faster to do (by hand)?

Depends on what's the goal of the exam exercise. Do you have to minimize the number of NOPs? In that case, since the solution involves re-ordering instructions, there's no general algorithm, it's pretty complicated and can only be done by hand (and even then, you can only have a certain confidence with the result). — Marco Bonelli, Feb 08 '20 at 18:06
Real MIPS has bypass forwarding and 1-cycle latency `add`. (First-gen MIPS didn't have interlocks and did have a delay slot after loads, as well as branches). What does your RISC pipeline have that requires *2* NOPs between that `addi` and a `sub` that consumes the result? https://en.wikipedia.org/wiki/Classic_RISC_pipeline. (And do you really mean NOPs, or stall cycles detected by hardware interlocking? Forcing compilers to insert NOPs everywhere on a pipeline without bypass forwarding seems hugely inefficient of I-cache footprint. Doesn't change the exercise, though.) — Peter Cordes, Feb 08 '20 at 18:14
Apparently classic MIPS assemblers could rearrange your code for you to fill load and branch delay slots; see this book: [See MIPS Run](https://doc.lagout.org/electronics/doc/mips/See%20mips%20run.pdf). Presumably they detect basic blocks in terms of labels and branch instructions, and do instruction scheduling within single blocks. Good question about what algorithms they use; modern compilers like GCC do instruction scheduling, too. (With modern assemblers leaving that to compilers and humans.) — Peter Cordes, Feb 08 '20 at 18:17
I'd guess that getting started on the longest dependency chain first would usually be a good start. Identify the independent dep chains and interleave them. By hand you can even do stuff like rearrange `add` and other associative operations to create different temporaries, creating more ILP (instruction-level parallelism.) — Peter Cordes, Feb 08 '20 at 18:19
You are correct in saying that adding NOPs to resolve every data hazard would be incredibly inefficient and that MIPS can actually identify such hazards and rearrange them. The idea was that, we recently learnt MIPS coding and as an introduction to pipelining the question was asked about rearranging the code to reduce the amount of NOPs. Of course I could use stall cycles and data forwarding but I just wanted to be prepared in case such a question comes up in an exam. To that end, you point about identifying the dependencies and interleaving them is quite useful. Thank you! — Learn To Adapt, Feb 08 '20 at 18:25
This code sequence lends itself to a fair amount of optimization, but of course, we also need to know if the registers are going to be used in subsequent code (if not, some of these instructions can be removed). — Erik Eidt, Feb 08 '20 at 20:53
The subtract takes the difference between two items that have been incremented by 4, so could just as easily operate on the non-incremented $a registers instead (modulo overflow, perhaps, but pointer arithmetic should have been done unsigned/no overflow anyway). The store word doesn't use a displacement (just 0), and it could use 4 as the offset, which could remove the need to add 4 to $a0 in the first place. (and $t5 isn't even shown as being used...) — Erik Eidt, Feb 08 '20 at 20:53

score 1 · Accepted Answer · answered Feb 08 '20 at 20:24

For an algorithm for instruction scheduling, you first need to identify the dependency chains. (Same as you would when identifying latency critical paths for out-of-order exec.)

To schedule instructions for an in-order machine: interleave instructions from different dep chains, starting with the longest one.

When hand-tuning (or in an optimizing compiler) you can even do stuff like rearrange associative operations (like add) to create different temporaries, creating more ILP (instruction-level parallelism). e.g. turning a+b+c+d into (a+b)+(c+d). Integer math is associative; floating point math generally isn't.

Of course this is only safe if the code uses addu / addiu, not MIPS's trap-on-signed-overflow add/addi. C compilers never use trapping add/sub so they can freely optimize arithmetic with different temporaries than the C source. You should, too, unless you specifically want an instruction to possibly trap on signed overflow.

Apparently classic MIPS assemblers could rearrange your code for you to fill load and branch delay slots; this Silicon Graphics assembler manual from 1992 goes into some detail about aggressive instruction reordering by the assembler (unless you use .set noreorder and then branch-delay slots become visible.) The book See MIPS Run may also mention this.

Presumably the SGI assembler detects basic blocks in terms of labels and branch instructions, and does instruction scheduling within single blocks.

Of course good compilers for high-level languages also do instruction scheduling. (For example GCC). I don't think the GNU assembler has an optimizing reordering mode; it's designed as a back-end for a compiler that schedules instructions itself.

Unlike your toy example with multi-cycle latency before you can use an add result, real classic MIPS was a classic 5-stage RISC with bypass forwarding, and single-cycle ALU latency. First-gen had no interlocks so there was a load-delay slot of 1 cycle, and branch-delay slots remained architecturally visible. But simple ALU instructions have 1 cycle latency. So real MIPS had far fewer hazards to avoid when scheduling instructions. But even so, later MIPS revision removed the load-delay slot for better code density when the relatively primitive compilers of the day couldn't find anything to put there. (Stalling instead of needing a NOP.)

A real machine with this many delay slots (and no hardware interlocking to stall) would be very impractical for code density / L1i cache footprint, as well as having bad performance. There's a reason real-world commercial designs do bypass forwarding instead of stalling. But your example of effectively multi-cycle latency is realistic for floating-point.

(fun fact: MIPS can forward a branch target address from first half of EX to a half-cycle IF for a total of only 1 cycle of branch latency.

MIPS was stuck with branch delay slots until a major (and not backwards compatible) reorg of opcodes introduces no-delay-slot branches (MIPS32/64r6 in 2014). The instruction in the branch delay slot executes whether the branch is taken or not, so later MIPS hardware couldn't remove it the way they could for load delay slots. RISC-V avoided this mistake.

How can I rearrange MIPS code to minimise the number of NOPs needed, by hand?

1 Answers1