The calling convention only defines what registers are call-preserved vs. call-clobbered, and where to find stack args.
It's 100% up to the function how it goes about making sure its return address is available somewhere when it's ready to return. The most trivial and efficient way to handle that is to just leave it in LR the whole time, in a leaf function. (A function that doesn't call others: it's a leaf in the call graph / tree).
Compilers in practice will usually just leave it in LR in leaf functions, even with optimization disabled. GCC for example sets up a frame pointer with optimization disabled, but still doesn't store/reload LR when it knows it didn't need so many scratch registers that it wanted to use LR.
Otherwise in non-leaf functions, normal compilers will typically store it to the stack, but if they wanted to they could for example save R4 to the stack and mov r4, lr
, then restore LR and reload R4 when they're ready to return.
A non-rentrant / non-threadsafe function could in theory save its return address in static storage, if it wanted to.
Source and GCC8.2 -O2 -mapcs-frame
output from Godbolt, forcing it to generate an APCS (ARM Procedure Call Standard) stack frame even when it's not needed. (It looks like it has a similar effect to -fno-omit-frame-pointer
which is on by default with optimization.)
void function_1(void){
//some code here
}
function_1:
bx lr @ with or without -mapcs-frame
void unknown_func(void); // not visible to the compiler; can't inline
void function_2(void){
function_1(); // inlined, or IPA optimized as pure and not needing to be called.
unknown_func(); // tailcall
unknown_func();
}
function_2: @@ Without -macps-frame
push {r4, lr} @ save LR like you expected
bl unknown_func
pop {r4, lr} @ around a call
b unknown_func @ but then tailcall for the 2nd call.
or with APCS:
mov ip, sp
push {fp, ip, lr, pc}
sub fp, ip, #4
bl unknown_func
sub sp, fp, #12
ldm sp, {fp, sp, lr}
b unknown_func
int func3(void){
unknown_func();
return 1; // prevent tailcall
}
func3: @@ Without -macps-frame
push {r4, lr}
bl unknown_func
mov r0, #1
pop {r4, pc}
Or with APCS:
func3:
mov ip, sp
push {fp, ip, lr, pc}
sub fp, ip, #4
bl unknown_func
mov r0, #1
ldmfd sp, {fp, sp, pc}
Since thumb interworking isn't needed (with the default compile options), GCC will pop the saved-LR into PC instead of just back into LR for bx lr
.
Pushing R4 along with LR keeps the stack aligned by 8, which IIRC is the default.