Your loop branch should be cmp / bne loop
, so you fall through to end the loop. That means one fewer branch instruction inside the loop. See Why are loops always compiled into "do...while" style (tail jump)?.
Also, take advantage of setting flags with instructions you already need, instead of using separate TST or CMP instructions.
If you're going to use a counter separate from the output pointer, count it down towards zero so you can subs r4, r4, #1
/ bne
.
There are a lot of missed optimizations in your code, especially your insane way of creating constants in registers. ARM has a real mov
instruction; use it instead of ANDing or ADDing with zero.
Have a look at what a good C compiler will do: compiler output is often a good starting point for optimization, or a way to learn tricks for the machine you're targeting. (See also How to remove "noise" from GCC/clang assembly output? and Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”.)
Your version stores the first element without checking its high bit, so if the input had only the high bit set, you'd store another 9 zeros. IDK if that's what you wanted, or if that's a case you don't have to handle. (i.e. maybe your inputs are guaranteed to be non-negative signed numbers).
// uint32_t would be more portable, but ARM has 32-bit unsigned int
void store_sequence(unsigned val)
{
unsigned *dst = (unsigned *)700;
unsigned *endp = dst + 10;
// val = 1; // use the function arg so it's not a compile-time constant
for (; dst < endp; dst++) {
*dst = val; // first store without checking the value
val <<= 1;
if (val >= (1UL << 31))
break;
}
}
Checking val
right after shifting it gives nice asm: otherwise the compiler doesn't always take advantage of using the shift to set flags. Note that even though it's a for()
loop, the compiler can prove that the condition is true the first time, and doesn't add an extra check / branch at the top to see if the loop should run zero times.
I put this code on the Godbolt compiler explorer to get gcc and clang output for ARM.
gcc7.2.1 -O3
fully unrolls the loop. With a larger count, it eventually decides to make a loop, but the unrolled loop is interesting: no pointer increment is required if you fully unroll. Using a different shift-count to re-shift the original also creates instruction-level parallelism (a CPU could run multiple shift instructions in parallel because there's no dependency on the result of the previous one.)
Notice that lsls
sets flags from the shift, and ARM's flags include a N
flag that's set if the high bit of the result is set. The MInus condition is true if N==1
. The name comes from 2's complement negative numbers, but everything is just bits and you can use it to branch on the high bit. (The PLus condition is strangely named: it's true for non-negative results including zero, i.e. it only checks that N==0
. https://community.arm.com/processors/b/blog/posts/condition-codes-1-condition-flags-and-codes)
Instead of an actual bmi
(branch if minus), the compiler decided to use a predicated bx lr
. i.e. return if MInus, otherwise it runs as a NOP. (Using -mcpu=cortex-a57
results in a bmi
to the bottom of the loop, with a bx lr
there. Apparently the tuning options for that microarchitecture make gcc avoid predicated bx
instructions.)
@ On function entry, val is in r0. Use mov r0, #1 if you want
@ from gcc7.2.1 -O3
store_sequence:
mov r3, #0 @ this is the most efficient way to zero a reg
lsls r2, r0, #1 @ instruction scheduling: create r2 early
str r0, [r3, #700] @ gcc just uses offsets from a zeroed reg
bxmi lr @ if(val<<1 has its high bit set) return;
lsls r1, r0, #2
str r2, [r3, #704] @ 2nd store, using val<<1 after checking it
bxmi lr
lsls r2, r0, #3 @ alternating r1 and r2 for software pipelining
str r1, [r3, #708] @ 3rd store, using val<<2 after checking it
bxmi lr
...
To get a rolled-up loop, you can increase the loop count, or compile with -Os
(optimize for code-size).
With endp = dst+100
and gcc -O3 mcpu=cortex-a57
(to avoid bxmi lr
), we get an interesting loop that's entered by jumping in to the middle, so it can fall through at the bottom. (In this case it probably would have been more efficient just to let cmp
/ beq
run the first iteration, or to put cmp/bne at the bottom. -Os
does the latter.)
@ gcc -O3 -mcpu=cortex-a57 with loop count = 100 so it doesn't unroll.
store_sequence:
mov r3, #700
movw r2, #1100 @ Cortex-A57 has movw. add would work, too.
b .L3
.L6: @ do {
cmp r3, r2
beq .L1 @ if(p==endp) break;
.L3: @ first iteration loop entry point
str r0, [r3]
lsls r0, r0, #1 @ val <<= 1
add r3, r3, #4 @ add without clobbering flags
bpl .L6 @ } while(val's high bit is clear)
.L1:
bx lr
With -Os
, we get a better-looking loop. The only downside is that bmi
(or bxmi lr
) reads flags right away in the next instruction after lsls
sets flags. You could schedule the add
between them, though. (Or in Thumb mode you'd want to do it this way, because adds
has a shorter encoding than add
.)
@ gcc7.2.1 -Os -mcpu=cortex-a57
store_sequence:
mov r3, #700 @ dst = 700
.L3: @ do{
str r0, [r3]
lsls r0, r0, #1 @ set flags from val <<= 1
bxmi lr @ bmi to the end of the loop would work
add r3, r3, #4 @ dst++
cmp r3, #740
bne .L3 @ } while(p != endp)
@ FIM:
bx lr
With larger endp
that won't fit in an immediate operand for cmp
, gcc calculates it in a reg outside the loop.
It always uses mov
, or loads it from a literal pool in memory, instead of using add r2, r3, #8192
or something. I'm not sure I constructed a case where an immediate for add
would work but an immediate for movw
wouldn't.
Anyway, regular mov
works for small immediates, but movw
is a newer encoding that's not baseline, so gcc only uses movw
when you compile with -mcpu=
something that has it.