Best way to add delay/do nothing for n cpu cycles

Question

I need to add a delay into my code of n CPU cycles (~30). My current solution is the one below, which works but isn't very elegant.

Also, the delay has to be known at compile time. I can work with this, but it would be ideal if I could change the delay at runtime. (It is OK if there is some overhead, but I need the 1 cycle resolution.)

I do not have any peripheral timers left, that I could use, so it needs to be a software solution.

do_something();
#define NUMBER_OF_NOPS   (SOME_DELAY + 3)
#include "nops.h"
#undef NUMBER_OF_NOPS
do_the_next_thing();

nops.h:

#if NUMBER_OF_NOPS > 0
    __ASM volatile ("nop");
#endif
#if NUMBER_OF_NOPS > 1
    __ASM volatile ("nop");
#endif
#if NUMBER_OF_NOPS > 2
    __ASM volatile ("nop");
#endif
...

"*I do not have any peripheral timers left*" -- well you *can* multiplex one timer in software. But that's probably not good enough if you really need cycle-exact delays .... — , Jun 20 '17 at 12:06
Software delays are horrible so I propose a horrible solution: 30 consecutive `nop` instructions with a computed jump into them. — Weather Vane, Jun 20 '17 at 12:12
I'd be interested in *why* you need cycle-exact delays? Maybe there's some better solution working with less precision? If not, @WeatherVane's suggestion seems a good idea. — , Jun 20 '17 at 12:23
The only reason why you'd run out of peripheral timers is pretty much if they are all locked up with hardware resources, such as PWM or input capture etc. If this is not the case, then something is very wrong in your program design. But if you do need all timers for hardware, then you probably have a separate RTC which you can use to create a general-purpose timer driver. — Lundin, Jun 20 '17 at 12:28
"I do not have any peripheral timers left, that I could use, so it needs to be a software solution." - You should reddesign your architecture then. The STM32 families have enough timers. Not sure, but doesn't the CM0 have a SysTick timer like the CM3/4/7? — too honest for this site, Jun 20 '17 at 13:03
'It is OK if there is some overhead, but I need the 1 cycle resolution' - so, you don't use interrupts at all then? — ThingyWotsit, Jun 20 '17 at 13:14
Btw it seems rather unlikely that you actually need nanosecond accuracy on a mainstream STM32. If you do, you picked the completely wrong CPU for the task. You would likely need to use some specialized DSP instead. This sounds like a typical [XY problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). — Lundin, Jun 20 '17 at 13:18
For accuracy you solution is good. The question of why you would need this still remains. — Gerhard, Jun 20 '17 at 13:21
@ThingyWotsit The code is executed in an ISR with the highest priority. — gugelhüpf, Jun 20 '17 at 14:32
@Olaf It does have a SysTick Timer and we use it, but only with 1ms resultion. — gugelhüpf, Jun 20 '17 at 14:35
@Gerhard Ok. I thought it was best to generalize the problem for the question ,but maybe I am really to focused on this one way of solving it. What I want to do is put some delay between disableing and enabling the break funktion of TIM1 in a STM32F051 to achieve some sort of blanking. See page 388 in [Reference Manual](http://www.st.com/content/ccc/resource/technical/document/reference_manual/c2/f8/8a/f2/18/e6/43/96/DM00031936.pdf/files/DM00031936.pdf/jcr:content/translations/en.DM00031936.pdf) — gugelhüpf, Jun 20 '17 at 14:36
@megle: You cannot program the SysTick for 1ms resolution! It is a simple counter with either CPU_CLK/1 or /8. **Think about that**! — too honest for this site, Jun 20 '17 at 14:36
"Yes all the timers are used for hardware control." - IIRC there is at least one TIM (6 and/or 7 IIRC) which has no external connection. Anyway, it seems your system architecture is a bit messed up if you forgot about such delays. Whatever it is, CPU loops are definitively a very bad approach on such systems. What if an interrupt occurs? — too honest for this site, Jun 20 '17 at 14:38
@Olaf I meant the interrupt for the SysTick is configured to occur every ms. I am not sure about clock speed of the SysTick, I have to check that. — gugelhüpf, Jun 20 '17 at 15:08
As I assumed: TIM6 and 7 are available. So why not use those? Or read SysTick and compare (if you don't use /1 divisor, change this!). But that is still a bad approach. — too honest for this site, Jun 20 '17 at 15:10

0___________ · Answer 1 · 2017-06-20T19:20:08.993

1

In the cortex devices NOP is something which literally means nothing. There is no guarantee that the NOP will consume any time.They are used for padding only. I you will have several consecutive NOPs they will just be flushed from the pipeline.

For more information refer to the Cortex-M0 documentation. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/CHDJJGFB.html

software delays are quite tricky in the Cortex devices and you should use other instructions + possibly barrier instructions instead.

use ISB instructions 4 clocks + flash access time which depend what speed the core is running. For very precise delays place this part of code in the SRAM

edited Jun 20 '17 at 19:20

answered Jun 20 '17 at 15:59

0___________

60,014
4
34
74

It must be fetched from flash, that would take some time, so `NOP` is just fine. `M0` has no cache. – followed Monica to Codidact Jun 20 '17 at 19:15
1

But it has a pipeline and if you execute time consuming instructions they will be just fetched & discarded, so when you enter the delay routine - it can be 0, or completely unpredictable time - making them useless. ISB - 4 cycles + flash access – 0___________ Jun 20 '17 at 19:18
It won't take less time as the time needed for reading so many bytes from flash. And it won't take more, since (as you are saying) they will be discarded from the pipeline. They must enter the pipeline somehow first, to be discarded afterwards, don't they? – followed Monica to Codidact Jun 20 '17 at 19:34
3

@berendi I afraid, you should read more about the Cortex core before having such a strong opinions. The best source is the ATM website. Even M0 core implements multistage pipeline, and from the programmer point of view you cant predict how many NOPs were discarded, unless you use the ISB instruction. But if you know better than the ARM - of course you can, but I prefer to trust the official documentation instead. Cortex cores are not like AVR or '51 uCs – 0___________ Jun 20 '17 at 22:34
The official docs state that it is implementation dependent, but I have never seen an implementation where NOP takes zero cycles. You may, however, encounter pre-fetch delays and other complications so any NOP routine should be measured before actual use. – Tony K Jun 21 '17 at 08:33
I still do not understand why do not use instructions with predictable timings. NOPs inline routine depends on the code which has executed earlier. – 0___________ Jun 21 '17 at 08:36
In that case you can MOV a register to itself like https://stackoverflow.com/questions/27510198/how-to-delay-an-arm-cortex-m0-for-n-cycles-without-a-timer – Tony K Jun 21 '17 at 08:39
Incidentally ISB has different cycle counts on different cores, for example on an M0+ it is 3 cycles due to the shorter pipeline. – Tony K Jun 21 '17 at 08:42
Of course they are implementation dependent, but this kind of the tick accurate delay routines are too. – 0___________ Jun 21 '17 at 08:56
So the execution time of `ISB` is not more predictable as that of a `NOP`. – followed Monica to Codidact Jun 21 '17 at 09:26
It is as you execute it on a particular micro, not the virtual one. Always the critical parts have to be written for the actual core. – 0___________ Jun 21 '17 at 09:30

Tony K · Answer 2 · 2017-06-22T09:53:32.067

1

Edit: There is a better answer from another SO Q&A here. However it is in assembly, AFAIK using a counter like SysTick is the only way to guarantee any semblance of cycle accuracy.

Edit 2: To avoid a counter overflow, which would result in a very, very long delay, clear the SysTick counter before use, ie. SysTick->VAL = 0;

Original:

Cortex-Ms have a built in timer called SysTick which can be used for cycle accurate timing purposes.

First enable the timer:

SysTick->CTRL  = SysTick_CTRL_CLKSOURCE_Msk | 
               SysTick_CTRL_ENABLE_Msk;

Then you can read the current count using the VAL register. You can then implement a cycle accurate delay this way:

int count = SysTick->VAL;
while(SysTick->VAL < (count+30));

Note that this will introduce some overhead because of the load, compare and branch in the loop so the final cycle count will be a little off, no more than a few ticks in my estimation.

edited Jun 22 '17 at 09:53

answered Jun 21 '17 at 08:30

Tony K

281
1
7

Original question is about CPU-cycle-accurate delays, SysTick is useless here. Handling interrupt is already close to required delay :) – Code Painters Jun 21 '17 at 08:41
The code does not require interrupt, it is simply a loop reading the count over and over. – Tony K Jun 21 '17 at 08:49
Yeah, I know - sorry for being unclear, I was only giving IRQ as a reference of sort. Still my point holds, I believe. – Code Painters Jun 21 '17 at 08:59
IMO it is the best you can get in "pure C", since otherwise you cannot guarantee the cycle timing of any given routine. Even instruction counting can be fraught with danger (pipeline, memory wait states, cache indeterminacy etc) as other answers and comments illustrate. – Tony K Jun 21 '17 at 09:07
As I stated in my answer for a very accurate results, such a critical block has to placed in the SRAM memory. On newer cores - in the CCMRAM or ICCMRAM – 0___________ Jun 21 '17 at 09:28
**Beware of counter overflow. The code above will hang forever, if it's called at the "right" time.** – followed Monica to Codidact Jun 21 '17 at 09:38
1

@berendi thanks for the catch, I have added instruction to clear the counter before use which should avoid an overflow in all but the longest delays (2^23 cycles) – Tony K Jun 22 '17 at 09:57

score -1 · Answer 3 · answered Jul 08 '17 at 17:52

You can use a free-running up-counter as follows:

uint32_t t = <periph>.count;
while ((<periph>.count - t) < delay);

As long as delay is less than half the period of the counter, this is unaffected by wrapping of the counter value - the unsigned arithmetic produces the correct time delta.

Note that since you don't need to control the counter's value in any way, you can use any such counter in the system - even if it's being used for another purpose (as long, of course, as it really is running continuously and freely, and at a rate that gives you the timing resolution that you require).

I think you did not understand the problem. Accessing and counters and comparing take more time than required delay and it is not **"clock tick accurate"** — 0___________, Jul 08 '17 at 19:35

Best way to add delay/do nothing for n cpu cycles

3 Answers3