"Assume that loading and storing values in memory cost 100 cycles plus the cost of the instruction itself."
That makes little sense. Unless you're supposed to assume instruction fetch is that slow as well, this would be like having an instruction cache but no data cache.
(Or running your program with data memory mapped to an uncacheable memory region.)
As @Martin points out, you can make up any numbers you want and simulate accordingly, even if there's no plausible engineering reason that a CPU would be built that way.
If you're trying to model the performance of an emulator like MARS itself on the host CPU, it's still not particularly plausible that loads/stores would be that expensive either. Interpreter costs per instruction vary wildly with how well branch prediction works on the host CPU, and emulating the guest memory is a small part of what an interpreting emulator does.
Loads on modern x86 typically have 5 cycle latency for an L1d cache hit (from address being ready to data being ready), but they also have 2-per-clock throughput. Thus, even without any cache misses, up to 10 loads can be in flight at once just in two pipelined load execution units of an Intel Sandybridge-family CPU, or AMD K8 / Bulldozer / Zen. (With load buffers to track cache-miss loads, far more loads can be in flight at once in the whole out-of-order back-end.)
You can't say that a load "costs" 5 cycles on such a CPU unless you're talking about a specific context of surrounding code, e.g. traversing a linked list where you bottleneck on load latency because the address for the next load depends on the result of the current load.
In traversing an array, you typically generate the next pointer with an add
instruction (or MIPS addui
), which has 1 cycle latency. Even if loads had 5 cycle latency on a simple in-order pipeline, unrolling + software pipelining will let you sustain 1 load per clock cycle.
On a pipelined CPU, performance is not one-dimensional, you can't just put cost numbers on instructions and add them up. You can see this even for ALU instructions on classic in-order MIPS with a multiply instruction: if you don't use mflo
/ mhi
right away, the multiply latency is hidden by whatever instructions you fill that gap with.
As @duskwuff point out, a classic RISC 5-stage pipeline (fetch/decode/exec/mem/write-back) like first-gen MIPS assumes cache hits have 1/clock memory throughput and latency for access to L1d itself. But the MEM stage leaves room for 2 cycle latency for loads (including address-generation in the EX stage).
And I guess they didn't need a store buffer either. More complex CPUs use a store buffer to decouple execution from stores that can potentially miss in L1d cache, hiding store latency even for cache misses. That's very nice, but not required.
Early CPUs often used simple direct-mapped virtually addressed caches, enabling such low cache latency without crippling max clock speed. But on a cache miss, the pipeline stalls. https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Cache_miss_handling.
More complex in-order CPUs can scoreboard loads instead of stalling if any of them miss in cache, and only stall if a later instruction tries to actually read a register that was last written by a load that hasn't completed yet. This allows hit-under-miss, and multiple outstanding cache misses to create memory parallelism, allowing multiple 100-cycle memory accesses to be in flight at once.
But fortunately for you, your loop doesn't include any memory accesses in the first place. It's pure ALU + branch.
On a real MIPS with branch-delay slots, you might write it like this:
li $t0, 10 # loop counter
li $a0, 10 # total = counter # peel the first iteration
Loop: # do{
addi $t0, $t0, -1
bgtz $t0, Loop # } while($t0 > 0);
add $a0, $a0, $t0 # branch-delay: always executed for taken or not
This is still just 10+9+8+...+0 = (10*11)/2 which would be better done with a multiply instead of a loop. But that's not the point, we're analyzing the loop. We do the same number of additions, but we do a += 0
at the end instead of 0 + 10
at the start.
Notice I used the real MIPS bgtz
instruction, not the bgt
pseudo-instruction with $zero
. Hopefully an assembler would choose that for the special case of $zero
, but it might just follow the normal pattern of using slt $at, $zero, $t0
/ bne $at, $zero, target
.
Classic MIPS doesn't do branch prediction + speculative execution (it has a branch-delay slot to hide the bubble from the control dependency instead). But for this to work, it needs the branch input ready in the ID stage, so reading the result of the previous add
(which produces a result at the end of EX) will cause a 1 cycle stall. (Or worse depending on whether forwarding to the ID stage is supported. https://courses.engr.illinois.edu/cs232/sp2009/exams/final/2002-spring-final-sol.pdf question 2 Part (a) has an example like this, but I think they under-count the stall cycles if you need to wait for add WB to complete before bne/bgtz ID can start.)
So anyway, this should run at best 1 iteration per 4 cycles on a scalar in-order MIPS I that can forward from EX to ID. 3 instructions + 1 stall cycle before each bgtz
.
We can optimize it by putting the add $a0, $a0, $t0
between the loop counter and the branch, filling that stall cycle with useful work.
li $a0, 10 # total = counter # peel the first iteration
li $t0, 10-1 # loop counter, with first -- peeled
Loop: # do{
# counter-- (in the branch delay slot and peeled before first iteration)
add $a0, $a0, $t0 # total += counter
bgtz $t0, Loop # } while($t0 > 0);
addi $t0, $t0, -1
This runs at 3 cycles / iteration, for 3 instructions with no stall cycles (again assuming forwarding from EX to ID).
Putting the counter--
in the branch delay slot puts it as far before the next execution of the loop branch as possible. A simple bne
instead of bgtz
would work, too; we know the loop counter starts out signed positive and decreases by 1 each iteration, so it's not critical that we keep checking for non-negative as well as non-zero.
I have no idea what performance model you're using. If it's not a classic 5-stage MIPS then the above is irrelevant.