ret
is basically pop
into PC.
You jumped to doFirst
instead of calling it, so you should jump back with rjmp
to a label inside test
instead of ret
.
Or if doFirst
was a separate function, you made an optimized tail-call to it, so it returns to your caller. (Like how you (or a compiler) would implement int foo() { return bar(); }
)
You don't need to make doFirst
a separate function. You can inline it once, because you don't need it to return to 2 different places in your test
loop.
test:
cpi r16, 0
breq doFirst0;
cpi r16, 1
breq doFirst1;
donefirst:
cpi r16, 2
brne noSecond ; conditionally SKIP the inlined doSecond body
inc r17 ; inlined body of doSecond
noSecond:
rjmp test ; You don't need a slow 4-byte JMP, just a short relative jump
doFirst0:
inc r16
doFirst1: ; fall through instead of jumping back and finding that r16 == 1 now
inc r16
rjmp donefirst
Note: this optimization would change behaviour for r16 == 0
if we had both breq
instructions jump to the same place. We'd only increment once and then run the rest of the loop, instead of incrementing, then checking and finding r16 == 1
and incrementing again.
Really we might as well do if(r16 <=0) r16 = 2;
, with ldi r16, 2
If it was too big to duplicate, and you really did want a rcall
/ ret
to a helper function, you can conditionally jump over a rcall
instruction.
Using 2 CPI/BREQ pairs is a very inefficient way to test if r16 <= 1
(unsigned). AVR has a BRLO (BRanch if LOwer (unsigned)).
Since BR instructions are faster if not taken, we'll leave that code out of line, instead of inlining it (like I did for doSecond) and jumping over it with BRSH (Same or Higher).
test:
cpi r16, 1
brlo doFirst ;
donefirst: ; r16 will become 2, so we can't put this label later and skip the next check
cpi r16, 2
brne noSecond ; conditionally SKIP the inlined doSecond body
inc r17
noSecond: ; this label might as well be at test: directly
rjmp test
doFirst: ; rare case, only runs the first 2 iters, put it out of line
ldi r16, 2 ; 1 or 2 inc instructions make r16 = 2
rjmp donefirst
This whole loop doesn't make much sense unless an interrupt handler can modify or use r16 or r17, though. If that's not the case, you'd really want to just peel the first iteration and then fall into an infinite loop that does nothing.
When we enter your loop (if your ret
s went where intended) with r16 <= 2
, we end up with r16 = 2
and r17
incremented once on that first iteration.
test:
cpi r16, 2
brhi above2
ldi r16, 2 ; result of 1 or 2 inc instructions
; flags may differ from your version.
infloop_r16_eq_2: ; loop for the r16==2 case
inc r17
rjmp infloop_r16_eq_2
above2:
infloop: ; loop for the r16!=2 case
rjmp infloop
But if the registers are global variables that could be modified asynchronously, we can't just check r16
once and then continue forever.
I was curious what gcc would do, so I used register volatile unsigned char a asm("r16");
and "r17"
. Surprisingly this partially works, although it does warn that optimization may eliminate reads and/or writes to register variables. That seems to have happened for the inc r17
with AVR gcc4.6.4 -O3, but not at -O1
. gcc7.3 for x86-64 actually preserves it at -O3. Have a look on the Godbolt compiler explorer, and see also How to remove "noise" from GCC/clang assembly output?.
// x86-64 and AVR both have r14 and r15, but not r16/r17
register volatile unsigned char a asm("r14");
register volatile unsigned char b asm("r15");
void test() {
while(1) {
if (a <= 1)
a++; // or a = 2;
if (a == 2)
b++;
}
}
AVR gcc -O1 output:
test:
.L7:
cpi r16,lo8(2)
brsh .L2
subi r16,lo8(-(1)) ; inc r16
.L2:
cpi r16,lo8(2)
brne .L7 ; back to the top
; else fall through
subi r17,lo8(-(1)) ; inc r17
rjmp .L7 ; then back to the top
This looks pretty reasonable.
lo8(2)
is just 2, I don't know why gcc emits assembly like that. It's probably useful for symbol addresses or something, like lo8(test)
to get the low byte of a label address.
This is sort of a loop tail duplication optimization, but one of the tails is empty. So instead of skipping over an inc r17
, we just jump straight to the top of the loop.
The brne .L7
could jump to the brsh
instruction, because flags are still set from cpi r16, 2
. Gcc doesn't do that because we told it the registers were volatile, so it doesn't optimize away a 2nd read of the register.