Pushing LR before BL instruction in ARM assembly

Question

I'm trying to get a better understanding of why you push LR before calling a BL instruction. I understand that the BL instruction will branch to another subroutine before restoring the PC to the instruction address following the BL call but why is the LR pushed before the BL is called? I've written the entire recursive code for factorial computation below to give context. a and b are both variables that are written in pseudo.

LDR   RO, a
PUSH  (LR)
BL    factorial
STR   R0, b
POP   (LR)

factorial: 
CMP   RO, #0
MOVEQ R0, #1
MOVEQ PC, LR
MOV   R3, R0
SUB   R0, R0, #1
PUSH  (R3, LR)
BL    factorial
MUL   R0, R3, R0
POP   (R3, LR)
MOV   PC, LR

I understand how this program is supposed to flow but I'm confused about what addresses are being stored in the stack. Clearly, you want the address of the "STR R0, b" instruction to be put on the stack after your first branch call but how is that being preserved on the stack if the LR is pushed prior to the BL call?

BL clobbers LR, so if you want to return to your own caller later, you need to save/restore LR. — Peter Cordes, Dec 05 '18 at 05:39
so you mean that the BL call would generate a new LR which would overwrite the previous example, right? I understand why they would push LR to the stack before the call in that case but the code I have edited in is stand-alone and does not refer to anything else. i hope this gives a little more context as to the source of my confusion — Philip Kwak, Dec 05 '18 at 05:55
Right. `BL` puts `PC` into `LR` and jumps, `MOV PC, LR` is the opposite. Does it make more sense to you if you remove the recursive part? — domen, Dec 05 '18 at 09:03
I'm fine with the recursive aspect of it but what I'm stuck on is the placement of the LR into the stack prior to the BL call. If the BL call puts the PC into the LR, would that change the value of the LR already in the stack? — Philip Kwak, Dec 05 '18 at 14:27
Imagine you're 100 levels deep in recursion. Sure, the next 99 `BX LR` returns will branch to the exact same place, and the pushes of `LR` to the stack seem pointless. But the 100th return has to leave your recursive function. This would not be possible if you allowed `LR` to be overwritten each time. — cooperised, Dec 05 '18 at 16:55
cooperised- I understand that now but wouldn't the LR being pushed before the BL call into the factorial not be the LR necessary to come back to the instruction right after the BL call into the recursive function? — Philip Kwak, Dec 07 '18 at 01:08

artless noise · Answer 1 · 2018-12-06T15:22:27.280

but why is the LR pushed before the BL is called?

Here you are seeing the cost of recursion. Recursion looks simple from a higher level coding stand point. State is stored by a compiler in a stack frame. There is only one LR register which is fine for leaf functions. However, if you have an extended call chain, "A calls B calls C calls D", then the return address of "A, B and C" must be stored when executing in "D" with the LR the return to "C". For recursion, "A, B, C, and D" are all the same.

See: ARM Link register and frame pointer for more.

I believe that it is instructive to see these extra instructions. Often a loop can be formed instead of recursion and the linear flow will execute much faster and with the same amount of variables and less code. The stack frames and manipulation are hidden to programmers in higher level languages.

It is also common that the frame is not needed due to 'tail recursion'. Really only the first call to factorial needs to save a return address and instead of bl a simple b will do.

score 4 · Answer 2 · answered Dec 05 '18 at 12:42

The link register LR is used to hold the address to which a function should return when it finishes executing. The BL instruction is essentially a 'call'; it calculates the address of the next instruction and inserts it into LR before branching. The corresponding BX LR (branch to the address held in the link register) is the 'return'.

If one function calls another, however, then before issuing a BL instruction it must save the existing value of LR somewhere, otherwise it will be overwritten and lost forever. Pushing it to the stack is the simplest way to do this.

Bear in mind that (almost) no code is actually 'stand-alone'. It's likely that any code you write is part of a function, even if it's main(), and so the link register must be preserved.

The most common pattern you'll see in compiled code is that the link register is pushed to the stack right at the top of the function, and only popped again right at the bottom. In addition it's often just popped straight into the program counter, which causes a branch without needing an explicit BX LR. So something like

.function
    ; Push working registers and LR
    PUSH {r4-r8,lr}
    ; (rest of the function goes here)
    ; Pop working registers and PC for an implicit return
    POP {r4-r8, pc}

would be typical.

score 1 · Answer 3 · edited Feb 27 '23 at 19:40

Since I have a simulator handy...

.thumb

.globl _start
_start:
.word 0x20001000
.word reset
.word hang
.word hang
.word hang
.word hang
.word hang
.word hang

.thumb_func
reset:
    mov r0,#5
    bl test
    b hang

.thumb_func
hang:
    swi 0xFF
    b hang

test:
    cmp r0,#0
    bne test1
    bx lr
test1:
    sub r0,#1
    push {r3,lr}
    bl test
    pop {r3,pc}

Build:

08000000 <_start>:
 8000000:   20001000    andcs   r1, r0, r0
 8000004:   08000021    stmdaeq r0, {r0, r5}
 8000008:   08000029    stmdaeq r0, {r0, r3, r5}
 800000c:   08000029    stmdaeq r0, {r0, r3, r5}
 8000010:   08000029    stmdaeq r0, {r0, r3, r5}
 8000014:   08000029    stmdaeq r0, {r0, r3, r5}
 8000018:   08000029    stmdaeq r0, {r0, r3, r5}
 800001c:   08000029    stmdaeq r0, {r0, r3, r5}

08000020 <reset>:
 8000020:   2005        movs    r0, #5
 8000022:   f000 f803   bl  800002c <test>
 8000026:   e7ff        b.n 8000028 <hang>

08000028 <hang>:
 8000028:   dfff        svc 255 ; 0xff
 800002a:   e7fd        b.n 8000028 <hang>

0800002c <test>:
 800002c:   2800        cmp r0, #0
 800002e:   d100        bne.n   8000032 <test1>
 8000030:   4770        bx  lr

08000032 <test1>:
 8000032:   3801        subs    r0, #1
 8000034:   b508        push    {r3, lr}
 8000036:   f7ff fff9   bl  800002c <test>
 800003a:   bd08        pop {r3, pc}

And run it showing disassembly in execution order and memory accesses:

--- 0x08000020: 0x2005 movs r0,#0x05
--- 0x08000022: 0xF000 
--- 0x08000024: 0xF803 bl 0x0800002B
--- 0x0800002C: 0x2800 cmp r0,#0x00
--- 0x0800002E: 0xD100 bne 0x08000031
--- 0x08000032: 0x3801 subs r0,#0x01
--- 0x08000034: 0xB508 push {r3,lr}
write16(0x20000FF8,0x0000)
write16(0x20000FFA,0x0000)
write16(0x20000FFC,0x0027)
write16(0x20000FFE,0x0800)
--- 0x08000036: 0xF7FF 
--- 0x08000038: 0xFFF9 bl 0x0800002B
--- 0x0800002C: 0x2800 cmp r0,#0x00
--- 0x0800002E: 0xD100 bne 0x08000031
--- 0x08000032: 0x3801 subs r0,#0x01
--- 0x08000034: 0xB508 push {r3,lr}
write16(0x20000FF0,0x0000)
write16(0x20000FF2,0x0000)
write16(0x20000FF4,0x003B)
write16(0x20000FF6,0x0800)
--- 0x08000036: 0xF7FF 
--- 0x08000038: 0xFFF9 bl 0x0800002B
--- 0x0800002C: 0x2800 cmp r0,#0x00
--- 0x0800002E: 0xD100 bne 0x08000031
--- 0x08000032: 0x3801 subs r0,#0x01
--- 0x08000034: 0xB508 push {r3,lr}
write16(0x20000FE8,0x0000)
write16(0x20000FEA,0x0000)
write16(0x20000FEC,0x003B)
write16(0x20000FEE,0x0800)
--- 0x08000036: 0xF7FF 
--- 0x08000038: 0xFFF9 bl 0x0800002B
--- 0x0800002C: 0x2800 cmp r0,#0x00
--- 0x0800002E: 0xD100 bne 0x08000031
--- 0x08000032: 0x3801 subs r0,#0x01
--- 0x08000034: 0xB508 push {r3,lr}
write16(0x20000FE0,0x0000)
write16(0x20000FE2,0x0000)
write16(0x20000FE4,0x003B)
write16(0x20000FE6,0x0800)
--- 0x08000036: 0xF7FF 
--- 0x08000038: 0xFFF9 bl 0x0800002B
--- 0x0800002C: 0x2800 cmp r0,#0x00
--- 0x0800002E: 0xD100 bne 0x08000031
--- 0x08000032: 0x3801 subs r0,#0x01
--- 0x08000034: 0xB508 push {r3,lr}
write16(0x20000FD8,0x0000)
write16(0x20000FDA,0x0000)
write16(0x20000FDC,0x003B)
write16(0x20000FDE,0x0800)
--- 0x08000036: 0xF7FF 
--- 0x08000038: 0xFFF9 bl 0x0800002B
--- 0x0800002C: 0x2800 cmp r0,#0x00
--- 0x0800002E: 0xD100 bne 0x08000031
--- 0x08000030: 0x4770 bx r14
--- 0x0800003A: 0xBD08 pop {r3,pc}
read16(0x20000FD8)=0x0000
read16(0x20000FDA)=0x0000
read16(0x20000FDC)=0x003B
read16(0x20000FDE)=0x0800
--- 0x0800003A: 0xBD08 pop {r3,pc}
read16(0x20000FE0)=0x0000
read16(0x20000FE2)=0x0000
read16(0x20000FE4)=0x003B
read16(0x20000FE6)=0x0800
--- 0x0800003A: 0xBD08 pop {r3,pc}
read16(0x20000FE8)=0x0000
read16(0x20000FEA)=0x0000
read16(0x20000FEC)=0x003B
read16(0x20000FEE)=0x0800
--- 0x0800003A: 0xBD08 pop {r3,pc}
read16(0x20000FF0)=0x0000
read16(0x20000FF2)=0x0000
read16(0x20000FF4)=0x003B
read16(0x20000FF6)=0x0800
--- 0x0800003A: 0xBD08 pop {r3,pc}
read16(0x20000FF8)=0x0000
read16(0x20000FFA)=0x0000
read16(0x20000FFC)=0x0027
read16(0x20000FFE)=0x0800
--- 0x08000026: 0xE7FF B 0x08000027
--- 0x08000028: 0xDFFF swi 0xFF

I can maybe see your confusion since the return address for all but the last one is the same address and we could perhaps craft an example. But recursion often has more than the return address but has some other local variables that are changing, in this case our local variable is in r0 if you will so doesn't need to be saved to the stack on each call.

First time our return back to the top bl after reset:

write16(0x20000FFC,0x0027)
write16(0x20000FFE,0x0800)

The remaining times it is the same return address but we need N number of them on the stack for the code to work as written.

write16(0x20000FF4,0x003B)
write16(0x20000FF6,0x0800)

write16(0x20000FEC,0x003B)
write16(0x20000FEE,0x0800)

write16(0x20000FE4,0x003B)
write16(0x20000FE6,0x0800)

write16(0x20000FDC,0x003B)
write16(0x20000FDE,0x0800)

So as we unwind this we have those five addresses on the stack now.

read16(0x20000FDC)=0x003B
read16(0x20000FDE)=0x0800

...

read16(0x20000FFC)=0x0027
read16(0x20000FFE)=0x0800

In general bl modifies lr and places the return address on the stack (the above is thumb code not arm code but covers your questions as they work the same in this respect). so if you are nesting calls one() calls two(), two() calls three() for two() to get back to one() lr needs to be saved in two() so that it can be used, if you don't save lr then the call to three() changes lr and we can't get back.

If your recursion wants to use bl (looks like compiled code) for purity, and you want a way for that function, factorial in your example test in mine, to be able to return to the original caller then those two facts combine to having to push lr on the stack. If you want to bl to the top of the recursive function, the same entry point that the outside caller used then every call will be adding lr to the stack, and every return needs to pull it back off.

If you want to do some hand assembly to modify it and it doesn't call the same entry point you can get rid of the bl and the stack stuff.

test:
    push {r3,lr}
test1:    
    cmp r0,#0
    beq test2
    sub r0,#1
    b test1
test2:    
    pop {r3,pc}

can even leave the bl in there

test:
    push {r3,lr}
test1:    
    cmp r0,#0
    beq test2
    sub r0,#1
    bl test1
test2:    
    pop {r3,pc}

but if you want to return each time then breaking the loop has to be done differently. I don't have a solution off hand that uses bl and returns but is able to break out of the loop at the right time.

when you use instruction sets that the call puts the return on the stack you are doing the same thing even though the return address is the same value for most of the iterations you need the right number of them to handle all the returns and then get the last return to go back to the original caller. — old_timer, Dec 05 '18 at 21:57
thank you for the simulation and the detailed explanation. I'm not at a level where I can understand this with a simple read through but I'll be combing over this with a review into what thumb is, etc — Philip Kwak, Dec 07 '18 at 01:12
Early ARM, the ARM company we know today basically. The 32 bit arm instruction set they made a 16 bit thumb instruction set a subset of the arm instructions, you can look at the armv5 through armv7-ar architectural reference manuals from their site. At the time of the ARM7TDMI the thumb instructions had a one to one relationship with full sized arm instructions, there are now thumb-2 extensions and you dont need to worry about those, for this example, these instructions have a compatible 32 bit arm instruction. — old_timer, Dec 07 '18 at 03:57
push and pop are pseudo instructions that came with thumb they are specific led and stm instructions. and I made this boot like an arm based microcontroller rather than a full sized arm, but for your question thats not relevant either...cmp, bx, bl, bne, push, pop, mov all should resemble or be familiar with the 32 bit arm instructions you have been working with. you/me any of us could write a simulator for the handful of instructions you are using too and basically see the same stack writes and reads (as far as saving the same lr n-1 times). — old_timer, Dec 07 '18 at 04:01
Thanks for the overview. Gives me a good perspective into the history and development of ARM over the years! — Philip Kwak, Dec 13 '18 at 02:26

Pushing LR before BL instruction in ARM assembly

3 Answers3