Cycles per instruction in delay loop on arm

Question

I'm trying to understand some assembler generated for the stm32f103 chipset by arm-none-eabi-gcc, which seems to be running exactly half the speed I expect. I'm not that familiar with assembler but since everyone always says read the asm if you want to understand what your compiler is doing I am seeing how far I get. Its a simple function:

void delay(volatile uint32_t num) { 
    volatile uint32_t index = 0; 
    for(index = (6000 * num); index != 0; index--) {} 
}

The clock speed is 72MHz and the above function gives me a 1ms delay, but I expect 0.5ms (since (6000*6)/72000000 = 0.0005).

The assembler is this:

delay:
        @ args = 0, pretend = 0, frame = 16
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        sub     sp, sp, #16         stack pointer = stack pointer - 16
        movs    r3, #0              move 0 into r3 and update condition flags
        str     r0, [sp, #4]        store r0 at location stack pointer+4
        str     r3, [sp, #12]       store r3 at location stack pointer+12 
        ldr     r3, [sp, #4]        load r3 with data at location stack pointer+4 
        movw    r2, #6000           move 6000 into r2 (make r2 6000)
        mul     r3, r2, r3          r3 = r2 * r3
        str     r3, [sp, #12]       store r3 at stack pointer+12
        ldr     r3, [sp, #12]       load r3 with data at stack pointer+12
        cbz     r3, .L1             Compare and Branch on Zero
.L4:
        ldr     r3, [sp, #12]   2   load r3 with data at location stack pointer+12
        subs    r3, r3, #1      1   subtract 1 from r3 with 'set APSR flag' if any conditions met
        str     r3, [sp, #12]   2   store r3 at location sp+12 
        ldr     r3, [sp, #12]   2   load r3 with data at location sp+12
        cmp     r3, #0          1   status = 0 - r3 (if r3 is 0, set status flag)
        bne     .L4             1   branch to .L4 if not equal
.L1:
        add     sp, sp, #16         add 16 back to the stack pointer
        @ sp needed
        bx      lr
        .size   delay, .-delay
        .align  2
        .global blink
        .thumb
        .thumb_func
        .type   blink, %function

I've commented what I believe each instruction means from looking it up. So I believe the .L4 section is the loop of the delay function, which is 6 instructions long. I do realise that clock cycles are not always the same as instructions but since theres such a large difference, and since this is a loop which I imagine is predicted and pipelined efficiently, I am wondering if theres a solid reason that I am seeing 2 clock cycles per instruction.

Background: In the project I am working on I need to use 5 output pins to control a linear ccd, and the timing requirements are said to be fairly tight. Absolute frequency will not be maxed out (I will clock the pins slower than the cpu is capable of) but pin timings relative to each other are important. So rather than use interupts which are at the limit of my ability and might complicate relative timings I am thinking use loops to provide the short delays (around 100 ns) between pin voltage change events, or even code the whole section in unrolled assembler since I have plenty of program storage space. There is a period when the pins are not changing during which I can run the ADC to sample the signal.

Although the odd behaviour I am asking about is not a show stopper I would rather understand it before proceeding.

Edit: From comment, the arm tech ref gives instruction timings. I have added them to the assembly. But its still only a total of 9 cycles rather than the 12 I expect. Is the jump a cycle itself?

TIA, Pete

Think I have to give this one to ElderBug although Dwelch raised some points which might also be very relevant so thanks to all. Going from this I will try using unrolled assembly to toggle the pins which are 20ns apart in their changes and then return back to C for the longer waits, and ADC conversion, then back to assembly to repeat the process, keeping an eye on the assembly output from gcc to get a rough idea of whether my timings look OK. BTW Elder the modified wait_cycles function does work as expected as you said. Thanks again.

Remember that when you execute a load, you're still going to have to wait however long it takes for that request to go out to memory and the value to come back (hint: >0 cycles) before the next instruction can use it, regardless of [how well the core tries to pipeline everything else](http://infocenter.arm.com/help/topic/com.arm.doc.100165_0201_00_en/ric1417175925887.html). — Notlikethat, Sep 22 '15 at 14:53
Is this any use? http://stackoverflow.com/questions/18220928/processor-instruction-cycle-execution-time — Weather Vane, Sep 22 '15 at 14:56
Both interesting. The arm reference page I hadn't looked at, I'll update the question with what seem to be the timings if I understand right. Explains partly. — Pete, Sep 22 '15 at 15:10

ElderBug · Accepted Answer · 2017-08-07T14:56:57.350

14

First, doing a spin-wait loop in C is a bad idea. Here I can see that you compiled with -O0 (no optimizations), and your wait will be much shorter if you enable optimizations (EDIT: Actually maybe the unoptimized code you posted just results from the volatile, but it doesn't really matter). C wait loops are not reliable. I maintained a program that relied on a function like that, and each time we had to change a compiler flag, the timings were messed (fortunately, there was a buzzer that went out of tune as a result, reminding us to change the wait loop).

About why you don't see 1 instruction per cycle, it is because some instructions don't take 1 cycle. For example, bne can take additional cycles if the branch is taken. The problem is that you can have less deterministic factors, like bus usage. Accessing the RAM means using the bus, that can be busy fetching data from ROM or in use by a DMA. This means instructions like STR and LDR may be delayed. On your example, you have a STR followed by a LDR on the same location (typical of -O0); if the MCU doesn't have store-to-load forwarding, you can have a delay.

What I do for timings is using a hardware timer for delay above 1µs, and a hard-coded assembly loop for the really short delays.

For the hardware timer, you just have to setup a timer at a fixed frequency (with period < 1µs if you want delay accurate at 1µs), and use some simple code like that :

void wait_us( uint32_t us ) {
    uint32_t mark = GET_TIMER();
    us *= TIMER_FREQ/1000000;
    while( us > GET_TIMER() - mark );
}

You can even use mark as a parameter to set it before some task, and use the function to wait for the remaining time after. Example :

uint32_t mark = GET_TIMER();
some_task();
wait_us( mark, 200 );

For the assembly wait, I use this one for ARM Cortex-M4 (close to yours) :

#define CYCLES_PER_LOOP 3
inline void wait_cycles( uint32_t n ) {
    uint32_t l = n/CYCLES_PER_LOOP;
    asm volatile( "0:" "SUBS %[count], 1;" "BNE 0b;" :[count]"+r"(l) );
}

This is very short, precise, and won't be affected by compiler flags nor bus load. You may have to tune the CYCLES_PER_LOOP, but I think it will the same value for your MCU (here it is 1+2 for SUBS+BNE).

edited Aug 07 '17 at 14:56

answered Sep 22 '15 at 15:18

ElderBug

5,926
16
25

Great answer, I had to read up on asm syntax a bit to understand that macro. But I can't get it to work, do I need to pass something to the compiler to make it work correctly? I have included exactly the 2 lines relating to assembly, called it with various values of n from 1 to 250 (not sure if this function takes more than 8 bit numbers), but its still giving me only a few cycles of delay per call. I'm sure I am missing some compiler flag or something. – Pete Sep 22 '15 at 17:08
@Pete with last-generation processors you might have "got away with" dead reckoning from a software delay loop (ignoring the few interrupts there might be), but those days have gone. The only sensible way to delay (even if you have the time to spare) is with a timer, either directly, or better by monitoring a counter variable maintained by a regular timer interrupt, at the granularity required by the upper levels of the program. If you have to sit and wait, with nothing better to do, the program was badly structured. – Weather Vane Sep 22 '15 at 18:08
1

I completely understand what you mean WV, for most uses. But I need to separate pin transitions here with down to 20 ns time and I don't think I am going to manage that with timer techniques, the overheads are just too high I think. If you know thats wrong please do say how to do it though. To illustrate, I need to provide 3 pins with transitions 20 ns apart (ie one pin goes high, 20ns the next does, and 20ns later the next does). Its probably better to do this in dedicated hardware but if I can I prefer to avoid that. – Pete Sep 22 '15 at 18:16
@WeatherVane Using a timer isn't always a good idea when you have to wait just a few cycles. Personnaly I use the assembly loop when some peripheral requires just some ns. In Op's case, maybe he can find a better design, but I think it's still sensible. – ElderBug Sep 22 '15 at 18:16
@Pete About the assembly, It shouldn't require any flag. To be sure, you use it like this : `WAIT_CYCLES(10);` ? What is the assembly actually generated ? Also, you can replace `BGT` by `BNE`, and it should allows up to`n = 2**32 * CYCLES_PER_LOOP`, so this isn't the problem. – ElderBug Sep 22 '15 at 18:19
BTW Elder I found my problem. This was not working: for(index = num; index != 0; index--) { WAIT_CYCLES(1100000); } This does work: volatile uint32_t waits = 1100000; for(index = num; index != 0; index--) { WAIT_CYCLES(waits); } And the advise to check the asm is so true. I saw that the register holding the 1100000 value was not being updated when the outer (c) loop jumped back to the start. Making the value volatile moved the label (to jump back to) up one line, sorting the issue. – Pete Sep 22 '15 at 18:22
@Pete perhaps I was misled by the question stating a 1 ms delay, which you now say is ns. – Weather Vane Sep 22 '15 at 18:22
WV sorry yes probably misleading, I was testing this with 1ms (1000x so I can time the LED blink and check the speed). I would normally use a timer / interrupts for this kind of thing exactly as you say BTW. But I just don't think they will operate at these speeds, as there must be a few cycles overhead and I am only playing with 1 or 2 cycles for those particular transitions I mention. I also need some longer delays and I'm not sure how to mix timer based delays with these asm ones but thats for tomorrow... I might yet have to resort to a mixed hardware / software solution. – Pete Sep 22 '15 at 18:27
Weirdly though, I am still seeing that asm loop taking 6.5 cycles. Thats probably not a huge problem as I can see that I will have to use assembler to toggle those 3 difficult pins but I am interested in why. I am calling the macro with a value of 1100000, 10 times, for a delay of 1 second. Theres only an extra 7 lines of assembly in the 'outer loop' which my asm macro is nested in, which can't be the problem. I wonder if my hardware is actually running at the speed I think it is. Its all default and the board is the STM32F103C8T6. I thought the default speed was 72 MHz but maybe not. – Pete Sep 22 '15 at 18:34
@Pete I had a look earlier to see if the instruction rate is a fraction of the clock speed (as with some processors) but didn't turn up. If the timing is so short can you solve this empirically? – Weather Vane Sep 22 '15 at 18:35
WV well, the more I think about it, I wonder if hardware is the way, at least to get those really short transitions done. The pattern repeats itself thousands of times as I sample and convert the analog output of the device (a TCD1201D, if you want to look at the timing diagrams). Perhaps I need a little bit of logic (chips) to create the staggered timing for the control pins and simply clock that (slower) with the arm chip. I will need to see how reading and writing the GPIOs affects things as alluded to by Dwelch in the below answer. – Pete Sep 22 '15 at 18:40
@Pete is this the crux? Now you have stated your purpose, why not check device status instead of a delay? – Weather Vane Sep 22 '15 at 18:46
No, the 5 outputs are required to clock data through the ccd. It is a tricky beast and requires all these carefully timed transitions to move (as I understand it) analogue voltages through its internal 'shift register' (although of course its not a shift register, but some kind of analogue delay line in silicon). As you usher the charge along the slice of silicon, you sample the output voltage, before shifting the charge a little more, sampling again etc. Take a look at the datasheet, its a nightmare of a beast to cater for... – Pete Sep 22 '15 at 18:56
@Pete I just corrected the assembly loop, now it should always work as expected, even in another loop, and shouldn't need any volatile (even in the outer loop). About the speed of your MCU, maybe you didn't enable the PLL ? The timing you mentioned looks like you are running at 12MHz, which is your crystal frequency, and thus the default frequency for your MCU. PLL must be enabled in software, so if you didn't put any code for that, that must be it. – ElderBug Sep 23 '15 at 07:28
I was thinking that but on checking, in system_stm32f10x.c I have '#define SYSCLK_FREQ_72MHz 72000000' uncommented, so I don't have any idea whats going on. The xtal is 8MHz BTW so it can't be running at that (each instruction would be <1 cycle). – Pete Sep 23 '15 at 11:08
@Pete The `define` alone is not enough. You have to check if it is used to set the clock. Usually these defines are just to know the frequency, not necessarily to set it. Also, the internal oscillator is indeed 8MHz, but are you using an external crystal ? Maybe it's also 8MHz, but you should check if you haven't already. If you are sure about the 8MHz then I don't know. Maybe you should use a timer and a LED to check the actual frequency. – ElderBug Sep 23 '15 at 11:30
Well theres a function lower down in that system_stm file with 'ifdef SYSCLK_FREQ_72MHz' and the comments claim it sets the PLL so I am reasonably sure its doing it (only looked today). Perhaps its just related to the stuff Dwelch mentioned like cache or pipelining, although it seems to me that since I am only using r1 that should be running at full speed. Is the value to subtract (1) stored in flash and retrieved each iteration perhaps? I have not explicitly turned on cache (don't know how) and i'm not setting any optimising when compiling main as you said. – Pete Sep 23 '15 at 11:44
@Pete I really don't know. I just re-tested with my Cortex-M4, which should be the same architecture, and `wait_cycles(1000000)` with `CYCLES_PER_LOOP=3` gives me exactly 1000000 cycles. The functions in system_stm32f10x.c could be expected to work, so everything should be configured properly. Are you properly calling `SystemInit()` and `SystemCoreClockUpdate()` ? If you are then I have no idea. – ElderBug Sep 23 '15 at 12:40
Just tried putting 100 SUBS per iteration of the asm loop and it came out to fractionally over 1 second with 72000000 SUBSs in total executed. So I guess either the BGT and or the jump back to 0: is taking the other 5.5 cycles. I don't know much about pipelining, prediction and such but I wonder if this is a facet of those? BTW I am setting CYCLES_PER to 1 for this testing, so this is only off by a factor of 2 rather than 6.5. Reason is I will probably not be using a loop like this but discrete ops between GPIO toggling so I want to know time per SUB, which indeed seems to be 1 cycle now :-) – Pete Sep 23 '15 at 13:10

score 3 · Answer 2 · answered Sep 22 '15 at 17:37

this is a cortex-m3 so you are likely running out of flash? did you try running from ram and/or adjust the flash speed, or adjust the clocks vs flash speed (slow the main clock) so you can get the flash to as close to a single cycle per access as you can.

you are also doing a memory access for half of those instructions which is a cycle or more for the fetch (one if you are on sram running on the same clock) and another clock for the ram access (due to using volatile). so that could account for some percentage of the difference between one clock per and two clocks per, the branch might cost more than one clock as well, on an m3 not sure if you can turn that on or off (branch prediction) and branch prediction is a bit funny the way it works anyway, if it is too close to the beginning of a fetch block then it wont work, so where the branch is in ram can affect the performance, where any of this is in ram can affect the performance, you can do experiments by adding nops anywhere in front of the code to change the alignment of the loop, affects caches (which you likely dont have here) and can also affect other things based on how big and where the instructions lie in a fetch. (some arms fetch 8 instructions at a time for example).

not only do you need to know assembly to understand what you are trying to do but how to manipulate that assembly and other things like alignment, re-arranging the instruction mix, sometimes more instructions is faster than fewer and so on. pipelines and caches are difficult at best to predict if at all, and can easily throw off assumptions and experiments with hand optimized code.

even if you overcome the slow flash, lack of a cache (although you cannot rely on its performance), and other things, the logic between the core and the I/O and the speed of the I/O for bit banging might be another performance hit, no reason to expect the I/O to be a small number of cycles per access, it might even be double digit number of clocks. very early in this research you need to start gpio read only loops, write only loops, and read/write loops. If you are relying on the gpio logic to only touch one bit in a port rather than the whole port that might have a cycle cost so you need to performance tune that as well.

you might want to look into using a cpld if you are even close to the margin on timing and have to be hard real time, as one extra line of code or a new rev of the compiler can completely throw off the timing of the project.

Thanks, thats a whole new load of information and I am going to have to take some time to properly understand and read about it so I probably won't get to do that today. It raises a load of new questions though. I am wondering more and more if I really will have to implement this in hardware... hope not. — Pete, Sep 22 '15 at 18:37
the simplest thing to do, is if you have a debugger, compile/link for and load and run the program from ram. If you cant do that then have the program in the rom copy the real program over to ram then branch to it. get the flash out of the loop. just running these microcontrollers faster does not automatically improve performance as often the number of wait states for the flash increases, depending on the brand and family there maybe a flash register that someone has to set when upping the clocks, and that section describes the ratio at a high level — old_timer, Sep 22 '15 at 19:20
you likely dont have a data or i cache so you cant just flip that on and see if that helps. my guess is that you have a combination of two data cycles per instruction for many of the instructions (the fetch itself and a data access) so you should not be at one cycle per instruction but 1.5 or 2 even if you could eliminate all other cycle steelers... — old_timer, Sep 22 '15 at 19:21
another comment is I think arm claims the cortex-m or at least m3 is a harvard architecture, I and D are separate, but you can use data operations to write to ram and then execute those so that is a bit of a fail. and I am willing to bet esp for something to be a microcontroller. that it isnt actually two busses but one with different commands. — old_timer, Sep 23 '15 at 04:35
reason for saying this if it really is two busses then your instruction fetch (flash in this case) and data operation can happen theoretically on the same clock cycle so you dont burn two clocks minimum for those data operations (if instructions are on the i bus, flash and data is on the d bus sram). if you run from ram though the instruction fetches and the data operations likely have to be serialized costing you more clocks...unless they have more than one data bus on this core... — old_timer, Sep 23 '15 at 04:38
See above comment (at end), looks like my BGT and jump are taking 5.5 cycles. So I guess this looks like pipeline misses although I don't know much at all about that. Re flash access time, with Elders asm above I am just executing 'subs r3, 1; bgt 0b;' which dosent look like its accessing flash but I'm not sure, what do you think? I'm not sure where the '1' (SUBS r3, 1) is stored, is it RAM, or a register etc? If it was a register I guess it would be rx rather than 1. Regardless when I unrolled the loop 100x I am now seeing the right speed so the SUBS is indeed taking the expected 1 cycle. — Pete, Sep 23 '15 at 13:30

Cycles per instruction in delay loop on arm

2 Answers2