How to unroll a loop of a dot product in mips after re-ordering instructions?

Question

I got this question about loop unrolling in mips but I could not figure out how once I got to the step I will show you below, I am not even sure about this steps. I am new to computer Arch, I just had this code snippet which is in assembly:

Loop: ld $f0, 0($s1) 
      ld $f4, 0($s2) 
      multd $f0, $f0, $f4 
      addd $f2, $f0, $f2 
      subi $s1, $s1, 8 
      subi $s2, $s2, 8 
      bneqz $s1, Loop

There is additional information given and they are as follows:

a one-cycle delayed branch and the following table:

Instruction producing result	Instruction using result	Latency in clock cycles
FP ALU op	Another FP ALU op	3
FP ALU op	Store Double	2
Load double	FP ALU op	1
Load double	Store double	0

So what I did is insert the stalls then try to reorder the instructions:

Loop: ld $f0, 0($s1) 
      ld $f4, 0($s2)
      STALL
      multd $f0, $f0, $f4
      STALL
      STALL
      STALL 
      addd $f2, $f0, $f2 
      subi $s1, $s1, 8 
      subi $s2, $s2, 8 
      bneqz $s1, Loop

I moved the addd instruction after the bneqz instruction then I got stuck, can anyone help explain?

Suggest unrolling first, then more opportunities for covering stalls. — Erik Eidt, Dec 14 '22 at 03:16
Your loop condition is weird. You're starting from the end of the arrays, and looping until `$s1 == NULL`? Does that array actually start at address `0`? Normally you'd have an end-pointer (arr1+size) in another register, and loop `do { ptr1++ ; ptr2++; }while(ptr1 != endp1);`. Also, your code without reordering isn't showing anything in the branch-delay slot, so it would run whatever instruction comes next! You need a nop there as a placeholder before reordering. — Peter Cordes, Dec 14 '22 at 03:44
Anyway, you need multiple FP "accumulators", instead of one sum in `$f2`, use multiple registers that you add together outside the loop. That way you can hide the latency of the ADD dependency chain. (After unrolling to reduce loop overhead, you should be running into a latency bottleneck, instead of just front-end even on a scalar pipeline that can't run more than one instruction per clock). Related: [Unrolling FP loops with multiple accumulators](https://stackoverflow.com/q/45113527) on modern x86, which can run 2 FMAs per clock with 5c latency, so needs at least 10 registers on Haswell. — Peter Cordes, Dec 14 '22 at 03:47
On a scalar MIPS (1 IPC), looks like just 2 accumulators are needed to hide FP latency. 4 loads, from `0($s1)` and `8($s1)` etc. then two `multd`, then the integer `addiu` or `subiu`, then two `addd` (one of them in the branch delay slot). I'm not posting code so I don't spoil the fun of working it out yourself. More unrolling could amortize the loop overhead over more FP mul/add instructions, gaining a bit more throughput, but costing more startup code to reach a point where the number of elements left is a multiple of 3 instead of just 2. (2 each mul+add per 11 instructions = 4/11 FLOP/c) — Peter Cordes, Dec 14 '22 at 03:58
@PeterCordes , I kind of getting your point, unrolling the loop two times, I just did that but now I am getting delays that can not be avoided, the questions explicitly said that there is a solution to unroll this loop a sufficient amount of times to avoid any delay at all. It did not work with me or I am doing something wrong again. — 0xMe, Dec 14 '22 at 04:42
Are you summing into two separate registers, like `addd $f0, f0, f6` / bne / `addd $f2, $f2, $f10`? Using multiple accumulators is the key to hiding ALU latency when unrolling reductions such as dot products or sums of arrays. My solution includes those three instructions at the bottom of the loop, and avoids any stall cycles. Maybe [edit] your question with your unrolling attempt. — Peter Cordes, Dec 14 '22 at 04:45
Huh? Then why did you title the question "loop unrolling" if you're not allowed to unroll? — Peter Cordes, Dec 14 '22 at 04:50
With the 2 loads at the top, the best I could do with reordering is 1 stall cycle, with one pointer increment between loads and multd, and the other and the branch between multd and addd. So you probably need to software-pipeline the loop, starting with data already loaded for the multd, then starting the loads right after the multd so they're part of what fills the multd latency. That should avoid stalls without unrolling, although of course lower throughput, 2 FLOP per 7 instructions instead of 4 per 11. — Peter Cordes, Dec 14 '22 at 04:52
@PeterCordes I am so sorry, I miss understood you, yes we are allowed I though you are giving me something different than my example, It is my bad. So 2 different accumulators for the sum of the products, but shouldn't I use another instruction to add them up ? — 0xMe, Dec 14 '22 at 04:53
That extra instruction to add them goes outside the loop, so it runs once for the whole dot-product operation, not per-element. But like I said, there is a way to schedule this loop to avoid stalls without unrolling, if you're on a CPU that only runs 1 instruction per clock at most. (I'm used to optimizing for CPUs that run 4 IPC or more, with at most 2 of them being FP ops, so it's possible to keep the FP unit busy every cycle. So at first I though unrolling that way would be necessary.) Partially peel the first / last iterations, so the code inside the loop can start with the multd. — Peter Cordes, Dec 14 '22 at 04:58
@PeterCordes could you please share both codes now, as I am so confused I spent so much time doing them, unroll then getting stuck and throw papers and start over again. — 0xMe, Dec 14 '22 at 05:03
cuz once I reorder them after unrolling two times and use different register still I am not able to fill the gap between any two FP ALU operation with 3 instructions so I am sure the FP ALU unit is free for next OP, even when I did for some I am still facing the 1 cycle delay of FP op after load . — 0xMe, Dec 14 '22 at 05:05

score 1 · Accepted Answer · answered Dec 14 '22 at 05:59

Without unrolling or doing anything tricky, you can hide all but one stall cycle.

# $f0 = 0.0
# $t0 = end_pointer = $t1+size 
Loop: ld    $f6, 0($t1)
      ld    $f8, 0($t2)
      addiu $t1, $s1, 8         # fills load-delay slot

      multd $f6, $f6, $f8
      addiu $t2, $s2, 8         # multd in flight cycle 1

      bneq $t1, $t0, Loop       # multd in flight cycle 2
                             STALL  waiting for $f6
      addd $f0, $f0, $f6

There are only 2 instructions between multd and addd. With software pipelining, we can avoid that stall without unrolling. You might also call this a loop rotation, since the 1-iteration window of the loop that appears in the source doesn't start with a load and doesn't end with an addd. Loop rotation enables software-pipelining (putting the loads in the shadow of the multd to set up for the next iteration).

Peel the 2 loads from the first iteration, so the top of the loop can start with multd
Peel the multd/addd that consume those loads from the last iteration, so you'll need an extra copy of those instructions after the loop.

      ld    $f6, 0($t1)    # partially peeled first iteration
      ld    $f8, 0($t2)
Loop:
      # this iteration's loads are already done
      multd  $f4, $f6, $f8   # into a different reg so loads won't overwrite it
      ld     $f6, 8($t1)
      ld     $f8, 8($t2)     # offset=8 for next iter before the pointer increment
      addiu  $t1, $s1, 8 
      addiu  $t2, $s2, 8

      bneq   $t1, $t0, Loop
      addd   $f0, $f0, $f4
# loop ends with pointers pointing one-past-end of the arrays

      multd  $f4, $f6, $f8
      addd   $f0, $f0, $f4

Having the multiply write to a temp register instead of overwriting one of the registers we load into is essentially avoiding a WAW hazard to make this manual reordering possible. Having lots of architectural registers enables plenty of software-pipelining even for longer dependency chains. Speaking of which:

5 instructions between the multd and the addd that consumes its result.
4 instructions between the last ld and the multd next iteration that consumes its result. Higher-clocked CPUs won't have single-cycle load-use latency even for L1d cache hits, so this is a good thing even if not necessary in the pipeline you're tuning for. We could have put the loads after the pointer increments on that, but it wouldn't save any code-size, and presumably the same performance whether or not the 16-bit immediate displacement is zero or not on most MIPS CPUs.
6 instructions between addd on iteration and addd the next, the loop-carried dependency in this reduction.

On higher-performance CPUs, that loop-carried dependency would become a bottleneck and you'd need to unroll with multiple accumulators to hide it. e.g. Intel Haswell can run two FMAs (Fused Multiply-Add) per clock, with 5 cycle latency, so you need at least 10 accumulators (registers to sum into) to hide that latency. Or fewer in a dot product since you bottleneck on 2 loads per clock; each FMA needs two loads so you bottleneck on that. (Superscalar out-of-order exec of up to 4 front-end micro-ops per clock, with x86 being able to combine an FMA + load into one uop for the pipeline leaving room for pointer increments. Very different from a 1-IPC MIPS pipeline where even without unrolling, you don't hit any inter-iteration bottlenecks.)

With unrolling, you don't need software-pipelining on this pipeline:

Loop: ld $f6, 0($t1)
      ld $f8, 0($t2)
      ld $f10, 8($t1)
      ld $f12, 8($t2)

      multd $f6, $f6, $f8           # 2 insn gap after loading $f8
      multd $f10, $f10, $f12        # 1 insn gap after loading $f12
      addiu $t1, $s1, 16
      addiu $t2, $s2, 16

      addd $f0, $f0, $f6            # 3 insn gap after multd into $f6
      bneq $t1, $t0, Loop       # 2 insn gap after pointer increment
      addd $f2, $f2, $f10           # 4 insn gap after multd into $f10

# after the loop: reduce the two accumulators to one
      addd  $f0, $f0, $f2       # runs once per dot product, not per element
                                # so a stall hardly matters if you can't schedule it after other post-loop work.

We could almost hide FP latency without multiple accumulators or software pipelining, to avoid that extra addd outside the loop, if we schedule to the limits of load latency. Since we're mixing things around, I indented differently for the instructions associated with the two different mul/add ops.

Loop:  ld $f6, 0($t1)
       ld $f8, 0($t2)
         ld $f10, 8($t1)
       multd $f6, $f6, $f8         # 1 insn gap after loading $f8
         ld $f12, 8($t2)
     addiu $t1, $s1, 16

         multd $f10, $f10, $f12       # 1 insn gap after loading $f12
       addd $f0, $f0, $f6          # 3 insn gap after multd into $f6
     addiu $t2, $s2, 16

     bneq $t1, $t0, Loop
         addd $f0, $f0, $f10          # STALL: 2 INSN GAP after addd into $f0

So no, it turns out we can't hide all the latency this way, without multiple accumulators or software pipelining.

If you had to match the numeric results of purely serial code (like compiling without -ffast-math, strict-FP), you could software-pipeline this to hide that one stall cycle, rotating the loop so the loads are filling gaps after mults and adds.

But multiple accumulators are typically better numerically, assuming uniformly distributed non-negative things you're summing, so you're adding the same size small numbers to ever-larger values, with worse-and-worse rounding error. See Simd matmul program gives different numerical results - different from sum += a[i] * b[i] doesn't mean worse; usually it's numerically more accurate. As in Is this way of processing the tail of an array with SSE overkill? where elements of SIMD vectors are multiple accumulators, we see less error than a simple serial sum.

Using two accumulators costs only one extra addd outside the loop (beyond any checks for unrolling), vs. loop rotation peeling a whole iteration of the loop body (except pointer increments), part before the loop and part after.

I made some other changes to your loop for sanity:

use $t temporary registers; no need to use call-preserved $s regs for loop counters.
Increment pointers forward until reaching an end-pointer in another register. Like do{ p1++; p2++; } while(p1 != endp); instead of counting down until NULL pointer (address 0).
Treat pointers as unsigned; you don't want to trap on signed-overflow if one of the arrays happens to span the boundary between the low 2GiB and high 2GiB of memory. (MIPS add/sub instructions trap on signed overflow, addu / subu don't. They're the same binary operation on the data.)

How to unroll a loop of a dot product in mips after re-ordering instructions?

1 Answers1

Linked