2

As the title says, I'm having problems understanding the calling convention for the ARM architecture. In particular, I still struggle to know what happens with the LR register when you call a subroutine.

I think that the most obvious and safer way to treat LR register when you enter a subroutine is to store it into the stack but that behaviour doesn't appear in the documentation so I thought of the following example.

I'll write it in C because I think is easier to explain with that. Imagine you have only two functions

void function_1(void){
   //some code here
}

void function_2(void){
   //some code here
   function_1();
   //some code here
}

The way I would use the LR register inside of function_1 would be like I said before, I'd store its value inside the stack but if you see closer, function_1 doesn't call any other subroutine so that would be unnecessary.

Is it possible that when using an ARM compiler, that compiler would decide to not store LR into the stack?

I read about the calling standard in this web of infocenter

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Torgon
  • 111
  • 8
  • Related: [ARM Link and Frame pointer](https://stackoverflow.com/questions/15752188/arm-link-register-and-frame-pointer) – artless noise Apr 02 '20 at 21:33

1 Answers1

7

The calling convention only defines what registers are call-preserved vs. call-clobbered, and where to find stack args.

It's 100% up to the function how it goes about making sure its return address is available somewhere when it's ready to return. The most trivial and efficient way to handle that is to just leave it in LR the whole time, in a leaf function. (A function that doesn't call others: it's a leaf in the call graph / tree).

Compilers in practice will usually just leave it in LR in leaf functions, even with optimization disabled. GCC for example sets up a frame pointer with optimization disabled, but still doesn't store/reload LR when it knows it didn't need so many scratch registers that it wanted to use LR.

Otherwise in non-leaf functions, normal compilers will typically store it to the stack, but if they wanted to they could for example save R4 to the stack and mov r4, lr, then restore LR and reload R4 when they're ready to return.

A non-rentrant / non-threadsafe function could in theory save its return address in static storage, if it wanted to.

Source and GCC8.2 -O2 -mapcs-frame output from Godbolt, forcing it to generate an APCS (ARM Procedure Call Standard) stack frame even when it's not needed. (It looks like it has a similar effect to -fno-omit-frame-pointer which is on by default with optimization.)

void function_1(void){
   //some code here
}
function_1:
    bx      lr     @ with or without -mapcs-frame
void unknown_func(void);   // not visible to the compiler; can't inline
void function_2(void){
   function_1();   // inlined, or IPA optimized as pure and not needing to be called.
   unknown_func(); // tailcall
   unknown_func();
}
function_2:              @@ Without -macps-frame
    push    {r4, lr}         @ save LR like you expected
    bl      unknown_func
    pop     {r4, lr}         @ around a call
    b       unknown_func     @ but then tailcall for the 2nd call.

or with APCS:

    mov     ip, sp
    push    {fp, ip, lr, pc}
    sub     fp, ip, #4
    bl      unknown_func
    sub     sp, fp, #12
    ldm     sp, {fp, sp, lr}
    b       unknown_func
int func3(void){
    unknown_func();
    return 1;               // prevent tailcall
}
func3:           @@ Without -macps-frame
    push    {r4, lr}
    bl      unknown_func
    mov     r0, #1
    pop     {r4, pc}

Or with APCS:

func3:
    mov     ip, sp
    push    {fp, ip, lr, pc}
    sub     fp, ip, #4
    bl      unknown_func
    mov     r0, #1
    ldmfd   sp, {fp, sp, pc}

Since thumb interworking isn't needed (with the default compile options), GCC will pop the saved-LR into PC instead of just back into LR for bx lr.

Pushing R4 along with LR keeps the stack aligned by 8, which IIRC is the default.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Since ARMv5, `pop` and `ldm` are interworking branches, so unless you compiler for prehistoric ARM, the compiler can use `ldmfd` regardless. – EOF Mar 30 '20 at 18:21
  • @EOF: Ah thanks. I did just use `-O2` without `-march=` or `-mcpu=`, so GCC was in fact compiling for a prehistoric baseline. At least I thought GCC's baseline was old, but, `-mthumb-interwork` didn't change the code-gen. – Peter Cordes Mar 31 '20 at 04:08
  • As best as I can tell, mips-gcc will by default target a processor with load-delay slots (which were the first misfeature of MIPS to be fixed). I'm not surprised gcc's default ARM target is ridiculous. – EOF Mar 31 '20 at 04:57
  • @EOF: yeah, it's normal that GCC assumes a minimal baseline, making code that will definitely work if copied to any computer of the same ISA. The only time this isn't the case is for 32-bit x86, where GCC is usually configured with an i686 baseline, or sometimes i686 + SSE2. OTOH it's a pretty unfortunate choice for ARM because its baseline is too old to inline `atomic_fetch_add` and things like that! – Peter Cordes Mar 31 '20 at 05:01
  • Sorry for my long delay. I just wanted to thank you Peter, now I have more clearly the concepts – Torgon Apr 02 '20 at 08:19