Changing the pull up/down resistors for GPIO pins on a Raspberry Pi requires waiting for 150 cycles after enabling and disbaling the clock signal according to specs. Doing it a bit longer doesn't hurt but using a timer to wait is longer by magnitudes so I don't want to do that. So I have this simple busy loop:
for (int i = 0; i < 150; ++i) { asm volatile (""); }
0: e3a03096 mov r3, #150 ; 0x96
4: e2533001 subs r3, r3, #1
8: 1afffffd bne 4 <foo+0x4>
That loops 150 times, executing 300 instructions. Without instruction caching and without branch prediction that certainly is more than 150 cycles. But once those are turned on that loop runs way faster, faster than 150 cycles I think.
So how do I wait close to 150 cycles with or without instruction caches and branch prediction enabled? Note: worst case it could be 2 functions, delay_no_cache() and delay_cache()
This is not a duplicate of How to delay an ARM Cortex M0+ for n cycles, without a timer? since the instruction cache and branch prediction completly throws of the timing. Timing also differs between Raspberry Pi (ARMv6) and Raspberry Pi2 (ARMv7).
Does anyone know the execution timings (with and without cache) if one would insert a DMB, DSB (I guess those would be NOPs since not ram is accessed) or ISB instruction into the loop? Would that prevent the run-away effect when caches are enabled?