How to obtain reliable Cortex M4 short delays

Question

I am porting some code from an M3 to an M4 which uses 3 NOPs to provide a very short delay between serial output clock changes. The M3 instruction set defines the time for a NOP as 1 cycle. I notice that NOPs in the M4 do not necessarily delay any time at all. I am aware that I will need to disable compiler optimisation but I'm looking for a low level command that will give me reliable, repeatable times. In practice in this particular case the serial is used very occasionally and could be very slow but I'd still like to know the best way to obtain cycle level delays.

No I have no timers available that could be setup in time or spare for free running. — Ant, May 13 '14 at 08:46
I am unable to use a UART or peripheral timer to generate a 24ns delay. — Ant, May 14 '14 at 10:22
According to the [ARM Cortex-M3 Devices Generic User Guide](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0552a/CHDJJGFB.html) the NOP instruction will not necessarily consume any time on a Cortex M3 too. — Blue, Jun 26 '18 at 10:27

score 5 · Accepted Answer · answered Jun 01 '17 at 08:05

If you need such very short, but deterministic "at least" delays, maybe you could consider using other instructions than nop which have deterministic nonzero latency.

The Cortex-M4 NOP as described is not necessarily time consuming.

You could replace it to, say and reg, reg, or something coarsely equivalent to a nop in the context. Alternatively, when toggling GPIO, you could also repeat the I/O instructions themselves to enforce the minimal length of a state (such as if your GPIO writing instruction takes at least 5ns, repeat it five times to get at least 25ns). This could even work well within C if you were inserting nops in a C program (just repeat the writes to the port, if it's volatile as it should be, the compiler wouldn't remove the repeated accesses).

Of course this only applies to very short delays, otherwise for short delays, like mentioned by others, busy loops waiting for some timing source would work much better (they take at least the clocks required to sample the timing source, set up the target, and go through once the wait loop).

Many thanks, as I said below I'm using MOV R0,#1. It's been in use on many production units since shortly after I wrote the question in 2014 and so far it's worked perfectly. — Ant, Jun 01 '17 at 08:18

bunkerdive · Answer 2 · 2019-10-29T03:51:45.930

Use the cycle-counting register (DWT_CYCCNT) to get high-precision timing!

Note: I have also tested this using digital pins and an oscilloscope, and it is extremely accurate.

See stopwatch_delay(ticks) and supporting code below, which uses the STM32's DWT_CYCCNT register, specifically designed to count actual clock ticks, located at address 0xE0001004.

See main for an example which uses STOPWATCH_START/STOPWATCH_STOP to measure how long the stopwatch_delay(ticks) actually took, using CalcNanosecondsFromStopwatch(m_nStart, m_nStop).

Modify the ticks input to make adjustments

uint32_t m_nStart;               //DEBUG Stopwatch start cycle counter value
uint32_t m_nStop;                //DEBUG Stopwatch stop cycle counter value

#define DEMCR_TRCENA    0x01000000

/* Core Debug registers */
#define DEMCR           (*((volatile uint32_t *)0xE000EDFC))
#define DWT_CTRL        (*(volatile uint32_t *)0xe0001000)
#define CYCCNTENA       (1<<0)
#define DWT_CYCCNT      ((volatile uint32_t *)0xE0001004)
#define CPU_CYCLES      *DWT_CYCCNT
#define CLK_SPEED         168000000 // EXAMPLE for CortexM4, EDIT as needed

#define STOPWATCH_START { m_nStart = *((volatile unsigned int *)0xE0001004);}
#define STOPWATCH_STOP  { m_nStop = *((volatile unsigned int *)0xE0001004);}


static inline void stopwatch_reset(void)
{
    /* Enable DWT */
    DEMCR |= DEMCR_TRCENA; 
    *DWT_CYCCNT = 0;             
    /* Enable CPU cycle counter */
    DWT_CTRL |= CYCCNTENA;
}

static inline uint32_t stopwatch_getticks()
{
    return CPU_CYCLES;
}

static inline void stopwatch_delay(uint32_t ticks)
{
    uint32_t end_ticks = ticks + stopwatch_getticks();
    while(1)
    {
            if (stopwatch_getticks() >= end_ticks)
                    break;
    }
}

uint32_t CalcNanosecondsFromStopwatch(uint32_t nStart, uint32_t nStop)
{
    uint32_t nDiffTicks;
    uint32_t nSystemCoreTicksPerMicrosec;

    // Convert (clk speed per sec) to (clk speed per microsec)
    nSystemCoreTicksPerMicrosec = CLK_SPEED / 1000000;

    // Elapsed ticks
    nDiffTicks = nStop - nStart;

    // Elapsed nanosec = 1000 * (ticks-elapsed / clock-ticks in a microsec)
    return 1000 * nDiffTicks / nSystemCoreTicksPerMicrosec;
} 

void main(void)
{
    int timeDiff = 0;
    stopwatch_reset();

    // =============================================
    // Example: use a delay, and measure how long it took
    STOPWATCH_START;
    stopwatch_delay(168000); // 168k ticks is 1ms for 168MHz core
    STOPWATCH_STOP;

    timeDiff = CalcNanosecondsFromStopwatch(m_nStart, m_nStop);
    printf("My delay measured to be %d nanoseconds\n", timeDiff);

    // =============================================
    // Example: measure function duration in nanosec
    STOPWATCH_START;
    // run_my_function() => do something here
    STOPWATCH_STOP;

    timeDiff = CalcNanosecondsFromStopwatch(m_nStart, m_nStop);
    printf("My function took %d nanoseconds\n", timeDiff);
}

You can also verify this behavior with an oscilloscope, and digital pins. — bunkerdive, Apr 21 '17 at 04:53
@Ant, you can set the delay in ticks as needed; How short were you hoping for? — bunkerdive, Oct 29 '19 at 03:54
Same comments as in [this answer](https://stackoverflow.com/a/19124472/69809). On a 168MHz processor, `DWT_CYCCNT` overflows after 25 seconds, but when you do `1000 * nDiffTicks`, you will overflow it after 25ms, which is unnecessary. `stopwatch_reset()` is also usually not needed, although if you remove it then `stopwatch_getticks() >= end_ticks` won't work. I would suggest a simpler (and correct) implementation like [the `delayUS_DWT` function posted near the end of this article](https://www.carminenoviello.com/2015/09/04/precisely-measure-microseconds-stm32/#crayon-5de3265384f2c364446364). — vgru, Dec 02 '19 at 13:37
Surly the best way to measure cycles. Can you give a documentation where DWT registers are presented and explain how it works ? ARM and ST's doc I could find only presents the registers fields and directly put it in relation with ETM, ITM and so one — Welgriv, Apr 22 '20 at 07:57

score 2 · Answer 3 · edited May 23 '17 at 12:02

For any reliable timing, I always suggest using a general purpose timer. Your part may have a timer that is capable of clocking high enough to give you the timing you need. For serial, is there a reason you can't use a corresponding serial peripheral? Most of the Cortex M3/M4s that I'm aware of offer USARTS, I2C, and SPI, with multiple also offering SDIO, which should cover most needs.

If that is not possible, this stackoverflow question/answer details using the cycle counter, if available, on a Cortex M3/M4. You could grab the cycle counter and add a few to it and poll it, but I don't think you would achieve anything reasonably below ~8 cycles for minimum delay with this method.

This is not standard serial, for SPI and I2C I am happily using peripherals. This needs to be GPIO driven with a few cycles delay. I agree also that the cycle counter wouldn't work. — Ant, May 13 '14 at 08:49

score 0 · Answer 4 · answered May 14 '14 at 03:41

0

Well first you have to run from ram not flash as the flash timing is going to be slow, one nop can take many cycles. the gpio accesses should take a few clocks at least as well so you probably wont need/want nops just pound on the gpio. The branch at the end of the loop will be noticeable as well. you should write a few instructions to ram and branch to it and see how fast you can wiggle the gpio.

The bottom line though is that if you are on such a tight budget that your serial clock is that close to your processor clock in speed, it is very likely you are not going to get this to work with this processor. upping the pll in the processor wont change the flash speed, it can make it worse (relative to the processor clock) the sram should scale though so if you have headroom left on your processor clock and the power budget to support that then repeat the experiment in sram with a faster processor clock speed.

answered May 14 '14 at 03:41

old_timer

69,149
8
89
168

In practice 3 NOPs gives me just the time I want but I don't think that is good enough as the documentation states that they may be removed by the pipeline. I could imagine shipping product with a next version processor that has better optimisation and suddenly nothing works as previously. I'm looking for a reliable method of inserting a few nanosecond delay. I'm currently using MOV R0,#1 after turning off compiler optimisation, as I have found no comment about these being removed. – Ant May 14 '14 at 10:26
I would think about that statement, what would cause them to decide to remove them from the pipeline, what internal or outside forces, if your code is not changing, the system is tightly controlled the core would not have any new inputs or fetch variations, etc that would cause the pipe to not do the same thing it has always been doing. Now on the otherhand, sure from one rev of chip to another that might change, but you can look at the rev of the cores that are available and the rev that the chip vendor is using (I suspect they dont just pop out a cortex-m4 and replace it with another – old_timer May 14 '14 at 15:01
during a simple chip spin, but who knows. – old_timer May 14 '14 at 15:02
The bottom line is the same if the best you can do is three nops to get your timing, and this is not a PIC, you are too tight, you need some other chip, your processor speed to signal speed does not have enough margin. – old_timer May 14 '14 at 15:03
What would cause them to decide to remove them from the pipeline? Because they are implementing what they documented - the documentation says they may be removed. Need some other chip - this product is in production it's not a bedroom hobby. – Ant May 15 '14 at 12:01
well that is part of the game of documents, it may be that all implementation of that core or family of cores does it some of the time. It may be that some versions of one of the cores does it and the others dont. Once you get past those questions, then if there is a core that "sometimes" does it then the question is what determines the times it does and doesnt, and I certainly dont know the answers to any of those questions. I still contend with this processor even if the nops execute EVERY time, you are too tight on your processor speed to signal ratio. – old_timer May 15 '14 at 18:28

How to obtain reliable Cortex M4 short delays

4 Answers4

Linked