MARS apparently has an assembler bug on b
instructions when the label name is also a valid instruction mnemonic. It ends up as a branch-to-next instruction.
Changing the label name from AND
to AND_TOP
made it assemble correctly. Eraklon found that using a j
instruction also worked. (The :
instead of operands disambiguates the token as being a label instead of an instruction, so this is a bug in MARS, not a bug in your code. clang assembles your source code just fine. Not that you could run it outside of MARS; it depends on MARS system calls and no-delay-slot branches.)
I tested in Mars 4.5 under Java OpenJDK 1.8.0_232 on Arch GNU/Linux and reproduced @Eraklon's result. (But the reasoning in that answer is wrong.)
Both b AND
instructions assemble to 0x043fFFFF
(with listed instruction bgez $0, AND
), so the branch target is the instruction right after the branch. (What would be the delay slot on real MIPS). How to Calculate Jump Target Address and Branch Target Address? shows that MIPS relative-branch instructions are I-type, and sign-extend the 16 bit immediate (the low 16 of the instruction word) and left-shift by 2. (Or look at it unshifted as an offset in words). Relative to the end of the branch-delay slot. So an offset of -1
gets us to the instruction after the branch.
Also note that 0x043f
is not the right opcode and register encoding for bgez
. It should be 0x0401
for bgez $zero
. (http://www.mrc.uidaho.edu/mrc/people/jff/digital/MIPSir.html shows encodings in binary). According to llvm-objdump -d
(after assembling .word 0x043fffff
with clang -target mips
), that actually encodes 4 3f ff ff synci -1($1)
. I wonder if the MARS bug was ORing in a -1
value that was wider than it should have been, overwriting some higher bytes?
(bgez $0
is one way to encode an unconditional relative branch in MIPS. Another way is beq $0,$0, target
. There's no separate opcode, you just have to pick one of the I-type b...
machine instructions with inputs that are always true.)
MARS simulates this mis-assembled instruction as a branch-always with the offset it encoded: it effectively falls through the branch. Or with Settings->Delayed Branching enabled, the instruction after the branch executes twice: once as the delay slot, once as the branch target.
So apparently the simulator doesn't truly decode from machine code, even if you turn on settings->self modifying code. Or it's just the display that's broken? IDK, doesn't really matter, this is 100% a bug and instead of poking at the symptoms someone should just look at the MARS source and fix it.
Using a different label name gives correct results: the first b AND_TOP
assembles to 0x0401fffb
, with MARS listing it as bgez $0, 0xfffffffb
. (Note the numeric offset in the listing, instead of the AND label name.)
And it simulates correctly, with those branches going to the right place.
I didn't check your logic but it seems over-complicated. Note that la $a0, ($t0)
is an insane way to write move $a0, $t0
. Apparently MARS allows that. There was no reason to load from counter
in the first place, though; you can zero a register with addu $t0, $zero, $zero
or whatever else you want. Or write it as move $t0, $zero
.
Also this is silly:
beqz $t0, display #If $t0 equals 0 send to display function
b AND_TOP #send back to AND function if not
display:
Just bnez AND_TOP
instead of conditionally jumping over an unconditional branch.
Also, neither comment adds anything to the understanding. If there's something to say about why you jump, or the semantic meaning (e.g. in terms of high-level variables, not register names), then put that in a comment. e.g. bnez $t0, count_loop # more bits left to count?
Of course, as @Eraklon points out, your whole branching logic is super over-complicated. Just isolate the low bit and add it to the count whether it's zero or one.
Or if you care about performance, mask away even and odd bits, right shift by 1, and addu. Then you have 16x 2-bit accumulators packed into a register. Repeat with another mask until you have bytes, then either keep going or use a multiply trick to get bytes summed into the high byte. (See fast popcount Q&As here on stack overflow for bithack answers, or https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel. Your method is like https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive but more complicated. There are middle-ground options, e.g. clearing the lowest set bit and counting iterations to make it zero.)