I'm trying to understand some assembler generated for the stm32f103 chipset by arm-none-eabi-gcc, which seems to be running exactly half the speed I expect. I'm not that familiar with assembler but since everyone always says read the asm if you want to understand what your compiler is doing I am seeing how far I get. Its a simple function:
void delay(volatile uint32_t num) {
volatile uint32_t index = 0;
for(index = (6000 * num); index != 0; index--) {}
}
The clock speed is 72MHz and the above function gives me a 1ms delay, but I expect 0.5ms (since (6000*6)/72000000 = 0.0005).
The assembler is this:
delay:
@ args = 0, pretend = 0, frame = 16
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
sub sp, sp, #16 stack pointer = stack pointer - 16
movs r3, #0 move 0 into r3 and update condition flags
str r0, [sp, #4] store r0 at location stack pointer+4
str r3, [sp, #12] store r3 at location stack pointer+12
ldr r3, [sp, #4] load r3 with data at location stack pointer+4
movw r2, #6000 move 6000 into r2 (make r2 6000)
mul r3, r2, r3 r3 = r2 * r3
str r3, [sp, #12] store r3 at stack pointer+12
ldr r3, [sp, #12] load r3 with data at stack pointer+12
cbz r3, .L1 Compare and Branch on Zero
.L4:
ldr r3, [sp, #12] 2 load r3 with data at location stack pointer+12
subs r3, r3, #1 1 subtract 1 from r3 with 'set APSR flag' if any conditions met
str r3, [sp, #12] 2 store r3 at location sp+12
ldr r3, [sp, #12] 2 load r3 with data at location sp+12
cmp r3, #0 1 status = 0 - r3 (if r3 is 0, set status flag)
bne .L4 1 branch to .L4 if not equal
.L1:
add sp, sp, #16 add 16 back to the stack pointer
@ sp needed
bx lr
.size delay, .-delay
.align 2
.global blink
.thumb
.thumb_func
.type blink, %function
I've commented what I believe each instruction means from looking it up. So I believe the .L4 section is the loop of the delay function, which is 6 instructions long. I do realise that clock cycles are not always the same as instructions but since theres such a large difference, and since this is a loop which I imagine is predicted and pipelined efficiently, I am wondering if theres a solid reason that I am seeing 2 clock cycles per instruction.
Background: In the project I am working on I need to use 5 output pins to control a linear ccd, and the timing requirements are said to be fairly tight. Absolute frequency will not be maxed out (I will clock the pins slower than the cpu is capable of) but pin timings relative to each other are important. So rather than use interupts which are at the limit of my ability and might complicate relative timings I am thinking use loops to provide the short delays (around 100 ns) between pin voltage change events, or even code the whole section in unrolled assembler since I have plenty of program storage space. There is a period when the pins are not changing during which I can run the ADC to sample the signal.
Although the odd behaviour I am asking about is not a show stopper I would rather understand it before proceeding.
Edit: From comment, the arm tech ref gives instruction timings. I have added them to the assembly. But its still only a total of 9 cycles rather than the 12 I expect. Is the jump a cycle itself?
TIA, Pete
Think I have to give this one to ElderBug although Dwelch raised some points which might also be very relevant so thanks to all. Going from this I will try using unrolled assembly to toggle the pins which are 20ns apart in their changes and then return back to C for the longer waits, and ADC conversion, then back to assembly to repeat the process, keeping an eye on the assembly output from gcc to get a rough idea of whether my timings look OK. BTW Elder the modified wait_cycles function does work as expected as you said. Thanks again.