How do I add a delay of 150 cycles on ARM?

Question

Changing the pull up/down resistors for GPIO pins on a Raspberry Pi requires waiting for 150 cycles after enabling and disbaling the clock signal according to specs. Doing it a bit longer doesn't hurt but using a timer to wait is longer by magnitudes so I don't want to do that. So I have this simple busy loop:

for (int i = 0; i < 150; ++i) { asm volatile (""); }

0:   e3a03096        mov     r3, #150        ; 0x96
4:   e2533001        subs    r3, r3, #1
8:   1afffffd        bne     4 <foo+0x4>

That loops 150 times, executing 300 instructions. Without instruction caching and without branch prediction that certainly is more than 150 cycles. But once those are turned on that loop runs way faster, faster than 150 cycles I think.

So how do I wait close to 150 cycles with or without instruction caches and branch prediction enabled? Note: worst case it could be 2 functions, delay_no_cache() and delay_cache()

This is not a duplicate of How to delay an ARM Cortex M0+ for n cycles, without a timer? since the instruction cache and branch prediction completly throws of the timing. Timing also differs between Raspberry Pi (ARMv6) and Raspberry Pi2 (ARMv7).

Does anyone know the execution timings (with and without cache) if one would insert a DMB, DSB (I guess those would be NOPs since not ram is accessed) or ISB instruction into the loop? Would that prevent the run-away effect when caches are enabled?

possible duplicate of [How to delay an ARM Cortex M0+ for n cycles, without a timer?](http://stackoverflow.com/questions/27510198/how-to-delay-an-arm-cortex-m0-for-n-cycles-without-a-timer) — samgak, Apr 07 '15 at 08:32
Not a duplicate. Idealy I'm looking for a loop that has (nearly) the same instruction timing with and without instruction caching and branch prediction. I know the simple loop above runs about 500 times faster with instruction caching and branch prediction. — Goswin von Brederlow, Apr 07 '15 at 12:16

score 1 · Answer 1 · answered Apr 18 '15 at 06:35

I did a few measurements on my Raspberry PI 2 running delay() for 1-100000000 loops in factors of 10 and computed the cycle count from the time passing. This shows that without caches just a single pass through an empty loop is more than enough for a 150 cycle delay (that's just sub + bcs). A single NOP (in a sequence of 150) takes 32 (5030 total) cycles without caches and 1.5 cycles with (226.5 total). The orr;add;and;mov;orr;add;and;mov; loop also shows how pipelining and superscalar the CPU is, taking just 0.15 cyles per opcode. Not so good to get a well times loop.

In conclusion I have to give up and just use a timer based delay. That's actually faster without caches than a loop that takes 150 cycles with caches.

void delay(uint32_t count) {
    uint32_t a = 0, b = 0, c = 0, d = 0, e = 0, f = 0, g = 0, h = 0;
    while(count--) {
// branch icache dcache cycles/loop
// no     no     no     ~507
// no     no     yes      43.005
// no     yes    no        1.005
// no     yes    yes       1.005
// yes    no     no     ~507
// yes    no     yes      43.005
// yes    yes    no        1.005
// yes    yes    yes       1.005
// asm ("");

// branch icache dcache cycles/loop
// no     no     no     ~750
// no     no     yes      67.500
// no     yes    no       16.500
// no     yes    yes      16.500
// yes    no     no     ~750
// yes    no     yes      67.500
// yes    yes    no       16.500
// yes    yes    yes      16.500
// asm ("nop");
// asm ("nop");
// asm ("nop");
// asm ("nop");
// asm ("nop");
// asm ("nop");
// asm ("nop");
// asm ("nop");
// asm ("nop");
// asm ("nop");

// branch icache dcache cycles/loop
// no     no     no     ~505
// no     no     yes      43.500
// no     yes    no        1.500
// no     yes    yes       1.500
// yes    no     no     ~505
// yes    no     yes      43.500
// yes    yes    no        1.500
// yes    yes    yes       1.500
 asm ("orr %0, %0, %0" : "=r" (a) : "r" (a));
 asm ("add %0, %0, %0" : "=r" (b) : "r" (b));
 asm ("and %0, %0, %0" : "=r" (c) : "r" (c));
 asm ("mov %0, %0" : "=r" (d) : "r" (d));
 asm ("orr %0, %0, %0" : "=r" (e) : "r" (e));
 asm ("add %0, %0, %0" : "=r" (f) : "r" (f));
 asm ("and %0, %0, %0" : "=r" (g) : "r" (g));
 asm ("mov %0, %0" : "=r" (h) : "r" (h));

// branch icache dcache cycles/loop
// no     no     no     ~1010
// no     no     yes       85.005
// no     yes    no        18.000
// no     yes    yes       18.000
// yes    no     no     ~1010
// yes    no     yes       85.005
// yes    yes    no        18.000
// yes    yes    yes       18.000
// isb();

// branch icache dcache cycles/loop
// no     no     no     ~5075
// no     no     yes      481.501
// no     yes    no       141.000
// no     yes    yes      141.000
// yes    no     no     ~5075
// yes    no     yes      481.501
// yes    yes    no       141.000
// yes    yes    yes      141.000
// isb();
// isb();
// isb();
// isb();
// isb();
// isb();
// isb();
// isb();
// isb();
// isb();
    }
}

`asm ("orr %0, %0, %0" : "=r" (a) : "r" (a));` doesn't use `%1`, so it might be using whatever garbage was in the output operand as input if the compiler happens to pick different input/output registers. You already have enough instruction-level parallelism that the loop-carried dependency chains wouldn't bottleneck on latency, though, so it doesn't really matter. (I'm skeptical of 0.15 cycles per instruction. Even AMD Ryzen can only manage 5 instructions per clock (or 6 uops if there are multi-uop instructions), and Intel x86 CPUs are 4-wide superscalar. 6.66 wide is not plausible. — Peter Cordes, Sep 01 '17 at 19:49
Oh, I think I know what happened. you didn't use `asm volatile`, and your function doesn't use `a,b,c,d,e,f,g,h` after the loop, so gcc optimized away all your `asm` statements. (non-`volatile` `asm` with at least one output operand is considered a pure function of its inputs with no side effects. https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Volatile). Yup, https://godbolt.org/g/Qg7r8N confirms that's exactly what happened if you compiled with `-O1` or higher. Your function even compiles ok on x86, because the asm optimizes away :P Always check the compiler asm output. — Peter Cordes, Sep 01 '17 at 19:52

score 0 · Answer 2 · answered Apr 08 '15 at 14:48

You may need use a REPEAT MACRO FUNCTION to do your delay. Using a loop in run time, there will always be optimization, and the loop itself costs time too. You could iterate a macro of NOP for 150 times. There will be no optimization and no redundant cycles.

Here's a repeated-macro template:

#define MACRO_CMB( A , B)           A##B
#define M_RPT(__N, __macro)         MACRO_CMB(M_RPT, __N)(__macro)

#define M_RPT0(__macro)
#define M_RPT1(__macro)             M_RPT0(__macro)   __macro(0)
#define M_RPT2(__macro)             M_RPT1(__macro)   __macro(1)
#define M_RPT3(__macro)             M_RPT2(__macro)   __macro(2)
...
#define M_RPT256(__macro)           M_RPT255(__macro) __macro(255)

You can define your NOP instruction like this:

#define MY_NOP(__N)                 __asm ("nop");    // or sth like "MOV R0,R0"

Then you can repeat the instruction 150 times by just calling this：

M_RPT(150, MY_NOP);

It will really be executed 150 times.

Hope this helped.

That helps not at all. The problem isn't optimization since the asm() prevents that. It also isn't that the loop itself cost. That cost can be considered in the delay or minimized by unroling. The problem is that 150 NOP take 150 cycles without cache but only 15 cycles with (or something like it). — Goswin von Brederlow, Apr 09 '15 at 11:05

How do I add a delay of 150 cycles on ARM?

2 Answers2