How do I calculate clock cycles on mips assembly programming?

Question

I've searched up everywhere, and ive gathered pipelines or something. I've checked other programs and it seems like theres a single-cycle & multicycle: Clock cycles of Single-cycle and multi-cycle MIPS

How do I tell the difference for what cycle.

like for example, how many clock cycles would this be:

li $a0, 0 # $a0 = 0
 li $t0, 10 # Initialize loop counter to 10
Loop:
 add $a0, $a0, $t0
 addi $t0, $t0, -1 # Decrement loop counter
 bgt $t0, $zero, Loop # If ($t0 > 0) Branch to loop

My professor gave me this for an assignment: "Assume that loading and storing values in memory cost 100 cycles plus the cost of the instruction itself."

From what I've read, loading is 5 cycles, so why is my professor saying its 100 cycles. Can I make it whatever number I want? I'm very confused.

real memory is very slow relative to the processor. 100 cycles seems fast. but just say that is what it is. pipelines give the illusion of one clock per cycle so add one for each then decide what your branch penalty is. of course does fetching count as loading from memory? — old_timer, Mar 19 '19 at 06:04
counting cycles is a fantasy with this class of processors, there are some mcus you can count clocks all day long and be exactly accurate, an 8051 probably, a PIC definitely, 6502, etc. but an arm based mcu, sorry. anything with a pipeline, nope, not really, anything with dram and a cache, nope sorry, dont bother trying. So the reality is this is a bogus question with no real answer, but at the same time your professor has taught something with a fixed answer so you have to play the school game and answer the way the prof wants, then move on to reality after you pass the class. — old_timer, Mar 19 '19 at 06:06
sine we are not in your professors class and didnt see the lectures then how are we supposed to help here? there is the textbook pipeline stuff with for example the addi result to $t0 has to complete before the bgt can test it so does the pipe have to stall a cycle or two? on the first pass in does the li of $t0 cause the add $a0 to have to stall a cycle or two? This is based on the text book as real mips pipes very likely dont resemble the textbook, nor do other real processors pipes, arm, x86, etc...not in your class dont know what textbook you are using. — old_timer, Mar 19 '19 at 06:10
@old_timer our textbook is: MIPS Assembly Language Programming, by Robert L. Britton, Prentice Hall, Inc., 2004. But I'm not sure what you mean by the pipes that you mention... — Jordles, Mar 19 '19 at 06:20
if you are not pipelined then how many clocks depends on the design, each instruction can take as many as it wants based on the implementation. there is ideally some minimum sure, but even there there are design shortcuts if you will. — old_timer, Mar 19 '19 at 17:43

score 3 · Answer 1 · answered Mar 19 '19 at 04:58

This question doesn't make sense.

The standard multi-cycle RISC pipeline, as used in most educational MIPS implementations, is fundamentally based around the requirement that program and data memory can both be accessed simultaneously in a single cycle. To "assume that loading and storing values in memory cost 100 cycles" would require a completely different architecture.

score 2 · Answer 2 · answered Mar 19 '19 at 05:58

We have to distinguish between two cases:

Case 1: MIPS simulators

From what I've read, loading is 5 cycles, so why is my professor saying its 100 cycles.

You are not working with a real CPU but only with a simulated one. So arguing how many cycles your program "really" needs on the simulated CPU won't make sense.

Maybe the simulator simulates 5 cycles for each memory access. However, another simulator may simulate 10 cycles or only 1 cycle for a memory access.

This means that you'll have to say which simulator is used when talking about the number of simulated cycles. And your professor says that a simulator simulating 100 cycles shall be assumed.

Case 2: Real MIPS CPUs

Can I make it whatever number I want?

In this case you'll have to check the manual of the CPU to get the real number of cycles the CPUs need.

However, the instruction set of real MIPS-type CPUs is not 100% identical to the one of "MIPS" emulators. In your program the instruction bgt would work differently.

This means we also cannot argue how many cycles your program would need on a real MIPS CPU because we had to modify it before we can run it on a real MIPS CPU - and this would possibly change the number of cycles needed.

If you want to know if the number 100 is plausible when using a real CPU:

As we know from the "Spectre" and "Meltdown" security breaches, the time required to read memory on real CPUs is massively depending on the state of the CPU caches. If we assume that a0 points to a memory-mapped peripheral (which is never cached), 100 cycles may be plausible.

I am using MARS MIPS simulator if that helps in identifying how many cycles. — Jordles, Mar 19 '19 at 06:29
@Jordles The point is: It does not matter which simulator **you** actually use. The professor wants you to calculate the cycles based on a simulator that requires 100 cycles per memory access. — Martin Rosenau, Mar 19 '19 at 07:08

score 0 · Answer 3 · answered Mar 19 '19 at 14:26

"Assume that loading and storing values in memory cost 100 cycles plus the cost of the instruction itself."

That makes little sense. Unless you're supposed to assume instruction fetch is that slow as well, this would be like having an instruction cache but no data cache.

(Or running your program with data memory mapped to an uncacheable memory region.)

As @Martin points out, you can make up any numbers you want and simulate accordingly, even if there's no plausible engineering reason that a CPU would be built that way.

If you're trying to model the performance of an emulator like MARS itself on the host CPU, it's still not particularly plausible that loads/stores would be that expensive either. Interpreter costs per instruction vary wildly with how well branch prediction works on the host CPU, and emulating the guest memory is a small part of what an interpreting emulator does.

Loads on modern x86 typically have 5 cycle latency for an L1d cache hit (from address being ready to data being ready), but they also have 2-per-clock throughput. Thus, even without any cache misses, up to 10 loads can be in flight at once just in two pipelined load execution units of an Intel Sandybridge-family CPU, or AMD K8 / Bulldozer / Zen. (With load buffers to track cache-miss loads, far more loads can be in flight at once in the whole out-of-order back-end.)

You can't say that a load "costs" 5 cycles on such a CPU unless you're talking about a specific context of surrounding code, e.g. traversing a linked list where you bottleneck on load latency because the address for the next load depends on the result of the current load.

In traversing an array, you typically generate the next pointer with an add instruction (or MIPS addui), which has 1 cycle latency. Even if loads had 5 cycle latency on a simple in-order pipeline, unrolling + software pipelining will let you sustain 1 load per clock cycle.

On a pipelined CPU, performance is not one-dimensional, you can't just put cost numbers on instructions and add them up. You can see this even for ALU instructions on classic in-order MIPS with a multiply instruction: if you don't use mflo / mhi right away, the multiply latency is hidden by whatever instructions you fill that gap with.

As @duskwuff point out, a classic RISC 5-stage pipeline (fetch/decode/exec/mem/write-back) like first-gen MIPS assumes cache hits have 1/clock memory throughput and latency for access to L1d itself. But the MEM stage leaves room for 2 cycle latency for loads (including address-generation in the EX stage).

And I guess they didn't need a store buffer either. More complex CPUs use a store buffer to decouple execution from stores that can potentially miss in L1d cache, hiding store latency even for cache misses. That's very nice, but not required.

Early CPUs often used simple direct-mapped virtually addressed caches, enabling such low cache latency without crippling max clock speed. But on a cache miss, the pipeline stalls. https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Cache_miss_handling.

More complex in-order CPUs can scoreboard loads instead of stalling if any of them miss in cache, and only stall if a later instruction tries to actually read a register that was last written by a load that hasn't completed yet. This allows hit-under-miss, and multiple outstanding cache misses to create memory parallelism, allowing multiple 100-cycle memory accesses to be in flight at once.

But fortunately for you, your loop doesn't include any memory accesses in the first place. It's pure ALU + branch.

On a real MIPS with branch-delay slots, you might write it like this:

 li   $t0, 10                        # loop counter
 li   $a0, 10                        # total = counter    # peel the first iteration

Loop:                                # do{
 addi $t0, $t0, -1
 bgtz $t0, Loop                      # } while($t0 > 0);
 add $a0, $a0, $t0                   # branch-delay: always executed for taken or not

This is still just 10+9+8+...+0 = (10*11)/2 which would be better done with a multiply instead of a loop. But that's not the point, we're analyzing the loop. We do the same number of additions, but we do a += 0 at the end instead of 0 + 10 at the start.

Notice I used the real MIPS bgtz instruction, not the bgt pseudo-instruction with $zero. Hopefully an assembler would choose that for the special case of $zero, but it might just follow the normal pattern of using slt $at, $zero, $t0 / bne $at, $zero, target.

Classic MIPS doesn't do branch prediction + speculative execution (it has a branch-delay slot to hide the bubble from the control dependency instead). But for this to work, it needs the branch input ready in the ID stage, so reading the result of the previous add (which produces a result at the end of EX) will cause a 1 cycle stall. (Or worse depending on whether forwarding to the ID stage is supported. https://courses.engr.illinois.edu/cs232/sp2009/exams/final/2002-spring-final-sol.pdf question 2 Part (a) has an example like this, but I think they under-count the stall cycles if you need to wait for add WB to complete before bne/bgtz ID can start.)

So anyway, this should run at best 1 iteration per 4 cycles on a scalar in-order MIPS I that can forward from EX to ID. 3 instructions + 1 stall cycle before each bgtz.

We can optimize it by putting the add $a0, $a0, $t0 between the loop counter and the branch, filling that stall cycle with useful work.

 li    $a0, 10                        # total = counter    # peel the first iteration
 li    $t0, 10-1                      # loop counter, with first -- peeled

Loop:                                 # do{
                                          # counter--   (in the branch delay slot and peeled before first iteration)
 add   $a0, $a0, $t0                      # total += counter
 bgtz  $t0, Loop                      # } while($t0 > 0);
 addi  $t0, $t0, -1

This runs at 3 cycles / iteration, for 3 instructions with no stall cycles (again assuming forwarding from EX to ID).

Putting the counter-- in the branch delay slot puts it as far before the next execution of the loop branch as possible. A simple bne instead of bgtz would work, too; we know the loop counter starts out signed positive and decreases by 1 each iteration, so it's not critical that we keep checking for non-negative as well as non-zero.

I have no idea what performance model you're using. If it's not a classic 5-stage MIPS then the above is irrelevant.

How do I calculate clock cycles on mips assembly programming?

3 Answers3

Linked