Yes, a branch-delay slot offloads the responsibility of hiding branch latency to the compiler by making it architecturally visible, as you suspect. Why does MIPS use one delay slot instead of two? (- because first-gen MIPS managed to keep branch latency down to 1 cycle).
That's the entire point, and why a future CPU with branch prediction can't just get rid of the delay slot, so MIPS was saddled with it until MIPS32r6 broke backwards binary compat and reorganized the opcodes, introducing branches without delay slots.
As EOF mentioned in comments, a delay slot significantly complicates exception handling, because it's legal to put instructions that might fault into delay slots. For exception return, the CPU needs to know which instruction to run, and an address after that which might or might not be right after it.
Why is the branch delay slot deprecated or obsolete?
The instruction in the delay slot does execute before the code at the branch target address, including its effects on memory or registers. If the delay-slot instruction reads $ra
after jal
writes it, then yes, the delay-slot instruction sees changes made by the branch itself.
But I think you're asking about the called function, which is a separate question; no, the return address will be the instruction after the delay slot, because the delay-slot instruction already ran right after the j
or b
itself, while the CPU was fetching the code from the target address; that's the whole point.
So when you're programming for a machine with a delay slot, you need to understand the order of instruction execution, and try to fill the delay slot with something more useful than a NOP. (Although if you only ever use NOP, then your code will run the same on a machine with or without a delay slot. e.g. in MARS if you click the checkbox to toggle between simulating a MIPS with a delay slot or a fake simplified MIPS without.)
Classic MIPS assemblers apparently would try to fill delay slots for you, unless you used .noreorder
, as described in See MIPS Run. i.e. they would let your asm source look like your question says you "expect" (or at least want), i.e. without branch delay slots, and look for independent instructions that can be moved without affecting correctness.
(Compiler-generated code would use .noreorder
and have to compiler "expect" what's actually going to happen. Humans could use that, too, if they know to expect what the machine will actually do.)