3

I'm looking at some compiler output for a MIPS platform and struggling to understand how a function returns and what is allowable.

Here's a simple example:

int two_x_squared(int x)
{
    return 2*x*x;
}

If I compile it with Compiler Explorer I see

two_x_squared(int):
        sll     $2,$4,1
        mult    $2,$4
        mflo    $2
        j       $31
        nop

OK, no big deal here, I'm guessing j $31 jumps to the return address, and the nop might be something required to protect against speculative execution in the pipeline.

But then I compile with XC32 under -O2 and I get

two_x_squared:
    mul $4,$4,$4
    j   $31
    sll $2,$4,1

So... the line after the j $31 gets executed after the jump?!

Jason S
  • 184,598
  • 164
  • 608
  • 970
  • [Why does this load instruction come after a jump?](https://stackoverflow.com/q/53715539/995714), [What is the point of delay slots?](https://stackoverflow.com/q/15375084/995714) – phuclv Dec 02 '19 at 06:10
  • 1
    Yes, `jr $ra` is how MIPS returns; indirect jump to the link register. GCC/clang just use the same `j` mnemonic for direct and indirect jumps, apparently, not `jr`. – Peter Cordes Dec 03 '19 at 05:29

1 Answers1

6

This is called the branch delay slot.  Yes, the branch actually executes one instruction later than you would expect, and the compiler is supposed to fill the delay slot with something useful — by moving something done logically before the branch into that slot, or by moving something that would happen after the branch into that slot.

This was introduced into the original MIPS architecture (as well as HP PA RISC, others) to help with pipelined processors, as they have to drain and refill the pipeline on taken branches, which wastes instruction cycles.

The feature has been removed in later MIPS processors as well as the follow on open-source RISC V hardware.  More modern hardware uses other approaches to mitigate the wasted cycles associated pipeline refill, including branch prediction, some out of order execution, speculation, executing branches earlier in the pipeline.

Erik Eidt
  • 23,049
  • 2
  • 29
  • 53
  • OK, how come XC32 takes advantage of it, but gcc 5.4 for MIPS does not? – Jason S Dec 01 '19 at 22:27
  • Sorry, I don't know! – Erik Eidt Dec 01 '19 at 22:28
  • 2
    @JasonS: GCC does in general; it's just a missed optimization (I think) in that one case. e.g. https://godbolt.org/z/MjRD96 if you use `gcc -march=mips32` so it can use the `mul` instruction instead of legacy `mult`, it will put that in the delay slot. You can also use `-fno-delayed-branch` to tell it to always fill delay slots with NOP, e.g. so you can easily copy the code into MARS or something which by default simulates a MIPS variant without delay slots. – Peter Cordes Dec 02 '19 at 01:57
  • @JasonS: clang targeting `-march=mips3` will use the branch delay slot; it does the shift last. IDK if there's some MIPS rule that says you can't safely put `mflo` in a branch delay slot. Probably not; clang does so for `-target mips -march=mips2` https://godbolt.org/z/t5eRkJ. GCC still doesn't, maybe GCC just doesn't know how to split up the "canned sequence" of `mult` and `mflo`. (I had to use `mips2` because mips3 introduced 64-bit register width and `clang -target mips` doesn't know a 32-bit ABI or something? With `-target mips64` we get a shift by zero to redo sign-extension) – Peter Cordes Dec 02 '19 at 02:11
  • The Wikipedia Delay Slot article is not great; Raymond Chen's articles about it are better: [1](https://devblogs.microsoft.com/oldnewthing/20180411-00/?p=98485), [2](https://devblogs.microsoft.com/oldnewthing/20180412-00/?p=98495). More detail, MIPS examples instead of a DSP with 2 slots, and better discussion of the pipeline reasons. Probably more helpful for someone who hasn't heard of the concept. – Peter Cordes Oct 11 '21 at 01:27