0

Having trouble figuring out how to find runtime of assembly program

I’ve got a project and it’s comparing the runtime of Java vs assembly programs for mathematical algorithms like the Fibonacci sequence, Euclidean algorithm, factorial algorithm and buy and binary exponential algorithm. Tried using spim and mars but they don’t have this feature.

Any suggestions? Due dates coming up.

Ex fib code that calculates estimated clock cycles and fib given a user input.

.data

fibonacciLabel: .asciiz "Fibonacci sequence: " enterLabel: .asciiz "Enter a number to calculate its Fibonacci sequence: " cycleLabel: .asciiz "Execution time in cycles: " space: .asciiz " " # Space character newline: .asciiz "\n"

.text .globl main main: # Start measuring time - store the initial cycle counter value mflo $t4 mfhi $t5

# Prompt the user to enter a number
li $v0, 4               # Print string syscall
la $a0, enterLabel      # Load address of the prompt message
syscall

# Read user input
li $v0, 5               # Read integer syscall
syscall
add $t0, $zero, $v0     # Save the user input in $t0

# Initialize variables
li $t1, 0               # First Fibonacci number (F(0))
li $t2, 1               # Second Fibonacci number (F(1))
li $t3, 2               # Counter for the Fibonacci sequence

# Print the Fibonacci label
li $v0, 4               # Print string syscall
la $a0, fibonacciLabel  # Load address of the Fibonacci label
syscall

# Print the first two Fibonacci numbers (F(0) and F(1))
move $a0, $t1           # Move the first Fibonacci number (F(0)) to $a0
li $v0, 1               # Print integer syscall
syscall

# Print a space between the numbers
li $v0, 4               # Print string syscall
la $a0, space           # Load address of space character
syscall

move $a0, $t2           # Move the second Fibonacci number (F(1)) to $a0
li $v0, 1               # Print integer syscall
syscall

# Calculate and print the rest of the Fibonacci sequence

loop: # Check loop condition bge $t3, $t0, endLoop # If the counter is greater than or equal to the input, exit the loop

# Count the loop instruction (branch instruction)
addi $t9, $t9, 1

# Calculate the next Fibonacci number
add $t4, $t1, $t2       # Sum of the two preceding Fibonacci numbers

# Print a space between the numbers
li $v0, 4               # Print string syscall
la $a0, space           # Load address of space character
syscall

# Print the next Fibonacci number
move $a0, $t4           # Move the result to $a0
li $v0, 1               # Print integer syscall
syscall

# Store the result in $t2 for the next iteration
move $t1, $t2
move $t2, $t4

# Count the instruction inside the loop (addition and move)
addi $t9, $t9, 2

addi $t3, $t3, 1        # Increment the counter

# Count the instruction inside the loop (increment)
addi $t9, $t9, 1

j loop                  # Jump back to the beginning of the loop

endLoop: # Print a newline li $v0, 4 # Print string syscall la $a0, newline # Load address of newline syscall

# End measuring time - store the final cycle counter value
mflo $t6
mfhi $t7

# Calculate the cycle count difference
sub $t6, $t6, $t4
sra $t7, $t7, 1         # Shift the high part to get the accurate difference
sub $t6, $t6, $t7

# Print the cycle count label
li $v0, 4               # Print string syscall
la $a0, cycleLabel      # Load address of the cycle count label
syscall

# Print the number of cycles
move $a0, $t9           # Move the cycle count to $a0
li $v0, 1               # Print integer syscall
syscall

# Print a newline
li $v0, 4               # Print string syscall
la $a0, newline         # Load address of newline
syscall

# Exit the program
li $v0, 10
syscall
  • Performance depends on the MIPS implementation you're running on. Are you looking for a cycle-accurate simulation of a classic MIPS R2000 or something (classic in-order 5-stage pipeline) vs. R10000 (4-way superscalar out-of-order exec), with some specific cache sizes? (I don't know of such a simulator, unless GEM5 can do that.) Of course that would give you time in cycles, and you'd need to divide by the clock frequency to get seconds. Also, to compare hand-written asm against Java, you'd have to run a JVM on the same MIPS CPU, so MARS/SPIM aren't options for that. – Peter Cordes Aug 13 '23 at 00:18
  • QEMU can emulate a full-system MIPS CPU, but it's not a cycle-accurate simulator. It runs as fast as possible on the host CPU, not trying to count guest cycles or simulate microarchitectural stuff like cache misses or stalls for `mflo` after a `mult` or `div` that wasn't ready yet. QEMU may even do dynamic translation to the host's machine-code format, for some hots like x86 where QEMU has specialized code for some architectures. – Peter Cordes Aug 13 '23 at 00:22
  • Doesn't the fact that mars. & qtspim don't have these give you a clue about the differences? Also, maybe compare with a full C envrnment as assembly can acess a C library when it is available. – Erik Eidt Aug 13 '23 at 05:20
  • @PeterCordes thanks for replying. I’ve decided to stop looking for softwares that calculate it and instead use a formula using reasonable placeholders. I know my processors clock rate and it’s 2.30GHz but that’s about it. I believe I can manually find the instruction count but I’m not entirely sure if I’d be doing it correctly, I don’t think the marker would dive too deep into the accuracy so long as it’s somewhat reasonable. plus I don’t think the TA even understands assembly. – Wajdy Elsobky Aug 14 '23 at 00:48
  • Wait, you have a 2.3GHz MIPS processor? Or you're running a MIPS simulator on a 2.3GHz x86-64 host, and you actually want to know the speed of a MIPS simulator/emulator running hand-written asm vs. the speed of an optimized JVM JIT-compiling to native machine code, both on your native CPU? – Peter Cordes Aug 14 '23 at 00:50
  • I think SPIM can be run from the command line, so it exits when the MIPS program exits. Perhaps MARS can, too, but different emulators will have different speeds. (And Java startup overhead will hurt MARS, unless your MIPS program runs for a long time to amortize that startup overhead.) QEMU emulating Linux MIPS user-space should be fairly well optimized for speed. – Peter Cordes Aug 14 '23 at 00:52
  • Or if you do want to estimate performance for a real MIPS CPU, yes counting *dynamic* instructions (on the path of execution) is a good start if you can estimate the instructions-per-clock. With no stalls, it will be exactly 1.0 on a classic MIPS pipeline. And yes for simple programs like naive recursive Fibonacci, you can assume no cache misses, and with no mult or div instructions, so no stalls. – Peter Cordes Aug 14 '23 at 00:56
  • Real MIPS (with branch delay slots) fully hides branch latency on 5-stage pipelines like R2000, so there are no stalls or any need for branch prediction: [Why does MIPS use one delay slot instead of two?](https://stackoverflow.com/q/58422262) . (Often students are taught about a simplified pipeline without branch delay slots, which is not actual MIPS. In that case you either assume every branch costs an extra NOP, or you make some assumptions about a pipeline more like RISC-V with some branch prediction success rate, e.g. loop branches predict correctly except the final iteration.) – Peter Cordes Aug 14 '23 at 00:58
  • @ErikEidt, if I’m understanding you correctly, are you referring to inline assembly in c? – Wajdy Elsobky Aug 14 '23 at 01:20
  • @PeterCordes, sorry, what I meant by 2.30Ghz is that I went to MSI laptops settings and looked under performance and saw this “11th gen intel(R) core(TM)i7-11800H @2.30GHz”. Judging by what you’re saying it sounds like I have to run the programs (Java and assembly) on the same platform since different platforms will produce different results? For Java i just used System.nanoTime from start to end and made the difference my execution time and ran it on my command prompt. The range of the Outputs across all programs are 0.0023-1.18 ms. – Wajdy Elsobky Aug 14 '23 at 01:55
  • You can't run MIPS assembly directly on an x86-64, so yes, you can't directly compare native Java against MIPS assembly on your CPU. I just commented about that on your other question ([Very rough estimate for runtime of mips assembly program](https://stackoverflow.com/posts/comments/135559149)). – Peter Cordes Aug 14 '23 at 01:56
  • Also, your i7-11800H's max turbo frequency is 4.6GHz; I'd expect it to run at or near that for most single-threaded light-weight tasks like Fibonacci. The 2.3 GHz "sticker" frequency is the non-turbo frequency it can sustain at its rated power (40 watts I think) with all cores active running the most power-intensive workloads (like Prime95, doing SIMD floating-point FMAs.) i.e. the slowest it's ever limited to except when slowing down to save power, assuming adequate cooling. – Peter Cordes Aug 14 '23 at 02:00
  • @PeterCordes, could I use the formula; execution time = IC*CPI/Clock frequency? Assuming CPI is 1 (no stalls or hazards). Clock frequency being my laptops processor base speed (4.6GHz***) and IC which I’d have to manually calculate? Would this be a decent approximation? I believe the execution time should be faster than Java (what I’ve seen on the internet). - just read your other reply, seems like that’s not usually not the case – Wajdy Elsobky Aug 14 '23 at 02:00
  • That would calculate time for a 2.3 GHz MIPS R2000, not your CPU. Your Tiger Lake CPU is capable of up to 7 instructions per clock (if 2 of them are branches that macro-fuse, for a total of 5 uops, the width of the front-end; https://agner.org/optimize/ / https://en.wikichip.org/wiki/intel/microarchitectures/willow_cove). It can't usually sustain even 5 IPC, but more than 1 is very common. Of course, those are x86 instructions, not MIPS. x86-64 has fewer registers so it sometimes needs extra loads and stores, but it can use memory source operands so it sometimes save instructions that way. – Peter Cordes Aug 14 '23 at 02:04
  • @PeterCordes, so what do you suggest I do to complete this project? The professors requires us to use mips assembly. What would be my best bet to calculate the runtime given my resources? Or perhaps using hypothetical values on a mips processor. What should those values be? I’m gonna be upfront on the project that’s it’s a rough estimation. I ended up using chat gpt to write the code because it obviously knows more than me and it gave correct outputs. I then tried to ask it to add code that estimates execution time in cycles (I still need to remove pseudo instructions which isn’t allowed. – Wajdy Elsobky Aug 14 '23 at 02:30
  • @PeterCordes for example the Fibonacci sequence, the output was 12 (execution time in cycles) for fib of 5. Does that sound like a good estimation? Unfortunately chat gpt struggled to code that for the Euclidean and a program that calculates a^b – Wajdy Elsobky Aug 14 '23 at 02:31
  • It's kind of a nonsensical assignment. If I had to do something, I'd probably run Linux MIPS inside QEMU, and run a MIPS JVM inside that vs. my hand-written MIPS assembly-language program. That's I think the closest you could come to an apples-to-apples comparison, since it involves Java suffering the same emulation overhead as your asm. Also, no, ChatGPT doesn't know anything about assembly; it often makes buggy code, and I certainly wouldn't expect it to be well-optimized. Anything it learns is by example, and most examples are from beginners and aren't super optimized. – Peter Cordes Aug 14 '23 at 02:34
  • 12 cycles? That's *way* too short an interval to time accurately in Java (or any language on your native CPU). I have no idea what kind of Fibonacci program you wrote; if it's an unrolled loop with `add $t0, $t0, $t1` / `add $t1, $t1, $t0`, plus some startup stuff (like I did [for x86 in another Q&A](https://stackoverflow.com/questions/32659715/assembly-language-x86-how-to-create-a-loop-to-calculate-fibonacci-sequence/32661389#32661389)) then yeah 12 cycles is probably reasonable. It seems low if that was a recursive function, but recursive Fibonacci is terrible for performance. – Peter Cordes Aug 14 '23 at 02:37
  • @PeterCordes Is it possible to share the code with you somehow? It’s 100 lines long. – Wajdy Elsobky Aug 14 '23 at 02:38
  • You should be able to edit it into your question in a code block. If not, pastebin is an option. 100 lines sounds way too long unless most of them are comments. My x86 Fibonacci (iterative loop, not recursive) was 45 lines for the not-unrolled version and more than half are just comments. – Peter Cordes Aug 14 '23 at 02:41
  • @PeterCordes I’ve added it to the post. – Wajdy Elsobky Aug 14 '23 at 02:50
  • What MIPS simulator are you using where `hi:lo` hold a timestamp or cycle counter that updates on its own? Or is that code from ChatGPT, so that `mflo $t4` at the top of `main` is just ChatGPT hallucinating? The hi:lo registers are the outputs for `mult` and `div` instructions, and normally keep their values (except in some cases, [as Raymond Chen explains](https://devblogs.microsoft.com/oldnewthing/20180404-00/?p=98435)). – Peter Cordes Aug 14 '23 at 03:12
  • It seems weird to use multiple instructions inside the same loop to increment `$t9` multiple times. Just figure out which instructions you want to count and `add $t9, $t9, imm` by that amount, probably 5 since you have an `add` and two `move` pseudo-instructions for Fibonacci, plus loop overhead of a `bge` and another `add`. (And a wasted `j` because you didn't put the `b..` at the bottom of the loop, so actually 6 instructions if you weren't printing.) Of course you could just calculate how many iterations it will run from the input number, so `6 * n + startup overhead`. – Peter Cordes Aug 14 '23 at 03:15
  • Re: way too short to time: see [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) re: benchmarking in Java. A good benchmarking framework will hopefully use a repeat loop to run your code long enough that it can be meaningfully timed, and takes longer than timing overhead. – Peter Cordes Aug 14 '23 at 03:25

0 Answers0