0

I would like to call an assembly function from C. It is part of a basic example for calling conventions.

The function is a basic:

int mult(int A, int B){
    return A*B
}

According to the Procedure Call Standard for the ARM® Architecture the parameters A and B should be in registers r0 and r1 respectively for the function call. The return value should be in r0.

Essentially then I would expect the function to be:

EXPORT mult
mult MULT r0, r0, r1
     BX lr

With GCC 7.2.1 (none) -O1 -mcpu=cortex-m4 -mabi=aapcs, I get the following: (using Compiler Explorer)

mult:
    mul     r0, r1, r0
    bx      lr

Which is what I expected. However. If I disable optimizations (-O0) I get the following nonsense:

mult:
    push    {r7}
    sub     sp, sp, #12
    add     r7, sp, #0
    str     r0, [r7, #4]
    str     r1, [r7]
    ldr     r3, [r7, #4]
    ldr     r2, [r7]
    mul     r3, r2, r3
    mov     r0, r3
    adds    r7, r7, #12
    mov     sp, r7
    pop     {r7}
    bx      lr

Which means GCC is using r7 as a frame pointer I think and passing all of the parameters and return values via the stack. Which is not according to the AAPCS.

Is this a bug with Compiler Explorer, GCC or have I missed something in the AAPCS? Why would -O0 have a fundamentally different calling convention than specified in the AAPCS document?

Flip
  • 881
  • 7
  • 13
  • 2
    Please study that code more carefully. It does read from r0 and r1 (`str`). Inside the function, it copies them to the stack before reading them back, but that has no effect on the calling convention. – Marc Glisse Jul 21 '19 at 12:40
  • @MarcGlisse Thanks. I see what you mean. – Flip Jul 21 '19 at 12:46
  • this is as expected, not a bug. duplicate of at least one prior question, but I cant find it off hand. There is no reason to expect any two compilers to produce the same output, or two versions of the same compiler to produce the same output. And then when you get into command line options and build time choices for build from source compilers the output can change as well. for something simple like this a fully optimized output using the same calling convention should be as you found, but in general the output is what it is. (newer gcc can be worse than older ones btw) – old_timer Jul 21 '19 at 17:14

3 Answers3

3

Don't bother analyzing machine codes compiled for the debug mode, because they follow some very obscured sequences that allows step by step execution by breakpoints while keeping all the global/local variables visible.

It isn't only pointless, but more confusing if what you want is learning assembly.

Go for -O2 or even -O3 all the time.

artless noise
  • 21,212
  • 6
  • 68
  • 105
Jake 'Alquimista' LEE
  • 6,197
  • 2
  • 17
  • 25
1

This is not due to debugging in my opinion. -O0 takes out optimization passes. As a result the compiler doesn't see everything fits in registers nor that you don't call other functions. Hence it will always make a stack frame which is r7 in thumb2 (Cortex-m4).

If you code a much more busy function you will see a stack frame at even -O3. See why compiler writers try to get rid of them? You have trouble understanding things, but it also a horrible amount of code. goes even further and would see that,

  mov r0, xx  # our call sight, might also have to save r0-r3.
  mov r1, yy  # because mult might trash those.
  bl  mult
...
mult:
    mul     r0, r1, r0
    bx      lr

Can be replaced by,

mul  xx,yy,xx   # one instruction!

It is quite common for call overhead to be as much as the actual function body. Other features like a macro, an inline keyword or attribute, etc. can achieve similar effects. Compilers are really good at allocating register and getting rid of mov instructions. Your brain (or at least mine) is better at mapping high level problems to specific machine instructions, like clz, addc, etc. This is especially true if the higher level language doesn't have a way to denote what you want to do (use a carry, etc).

See also:

artless noise
  • 21,212
  • 6
  • 68
  • 105
  • `-fomit-frame-pointer` is on by default at `-O1` and higher, but not `-O0`. I think that's the main reason GCC sets up a frame pointer to access the variables it spills/reloads. – Peter Cordes Nov 24 '22 at 03:32
  • @PeterCordes That maybe part of it, but not completely. https://godbolt.org/z/TfsWfEK3d – artless noise Nov 24 '22 at 12:24
  • Your example shows exactly what I meant, that with `-fomit-frame-pointer` it references stack slots relative to SP instead of setting up R7 as a frame pointer for that, in a function that doesn't use VLAs / alloca. (It still spills/reloads everything because [`-O0` always does that](https://stackoverflow.com/questions/53366394/why-does-clang-produce-inefficient-asm-with-o0-for-this-simple-floating-point) for consistent debugging.) – Peter Cordes Nov 24 '22 at 12:31
  • Yes, but the OP wanted no use of the stack at all. You are pedantically correct. I used the word frame and point out the use of R7. The `-O0 ` also does not allocate to registers and uses the stack. So, you are right that the stack **frame** is due to `-O0`, but as the godbolt example points out, even when creating a frame is not required, `-O0` still uses the stack. Debuggers have no issue with code using a stack (and frame). That is the major point I am trying to make. And I have confounded stack use with a frame as usually they are the same thing. – artless noise Nov 24 '22 at 18:53
  • I was just replying about the one sentence "*Hence it will always make a stack frame which is r7 in thumb2 (Cortex-m4).*" Setting up a frame pointer or not doesn't depend on whether GCC has anything to spill, just on `-fno-omit-frame-pointer`. https://godbolt.org/z/9eGhYxc6P shows `void empty(){}` still setting up and tearing down an R7 frame pointer. Everything else in your answer is accurate. (Re: consistent debugging: part of the reason for spilling everything is to support continuing at a different line, jumping around in the C abstract machine. Agreed an R7 frame pointer is orthogonal) – Peter Cordes Nov 25 '22 at 03:29
0

Thanks to Marc Glisse for pointing out the obvious.

What is happening is that GCC is

  1. storing r0(A) and r1(B) on the stack. Then;
  2. reading in the variable from the stack into r2 and r3.Then;
  3. performing the multiply and storing the result in r3. Then;
  4. moving the result from r3 into the return register r0.

This seems like it is actively trying to make things slower...

But it is still AAPCS.

My bad.

Thanks Marc

Edit:

As Jake 'Alquimista' LEE mentions this might make sense for debugging. All of the function values are available to the debugger on the stack.

Flip
  • 881
  • 7
  • 13