ARM Cortex-M7 assembly timing on simple delay loop - how to explain results?

Question

Since AFAIK cycle timings are not published, I've decided to try to measure cycle count using DWT counter on STM32H750-DK; as a first example, I'm measuring a simple delay loop.

It seems that two instructions can be executed by the Cortex-M7 in each cycle. I'd understand this if they are translated into 16-bit instructions. But the results show the same also if I use registers R8 and above and the instructions are translated into 32-bits instructions.

Is it really branch prediction the main player here? On the first run, I get more cycles but on the consequent repetitions the addition of 6 cycles is noticed regardless of N.

Is there any more info somewhere about Cortex-M7 pipeline that would help explaining results I got? I'm not even sure if results make sense. Am I interpreting these results correctly?

//-------------- not measured --------------------------
//      ldr r5,=N
// ------------- code under cycle measurement ------
// tloop:  subs r5,r5,#1
//         bne  tloop
// ------------- konec kode ------------------------
/*
// Timings - usually in second or more repetitions
// (on first one cycles are higher in brackets)
╔═══════╤════════════════╗
║ N     │ DWT_CYCCNT(1st)║
╠═══════╪════════════════╣
║ 50    │ 56     (78)    ║
╟───────┼────────────────╢
║ 100   │ 106    (128)   ║
╟───────┼────────────────╢
║ 200   │ 206            ║
╟───────┼────────────────╢
║ 500   │ 506            ║
╟───────┼────────────────╢
║ 1000  │ 1006           ║
╟───────┼────────────────╢
║ 64000 │ 64006 (64028)  ║
╚═══════╧════════════════╝
Comment: difference: R5 instructions are 16-bit, R8 instructions are 32-bit,
         but both with same timing.
         If nop is added, for N=64000, results are 96030 (first run) and 96006.
Conclusion: it seems that branch prediction is the main influencer here.

At least for some of the Cortex A cores, ARM publishes a Software Optimization Guide that gives cycle counts, pipeline details, etc. Unfortunately they do not seem to have one for M7. — Nate Eldredge, Nov 23 '22 at 18:00
They do for M0/M3/M4 (in the Core TRM) I don't know why not for the M7, if they forgot or think it is more commercially sensitive. — Tom V, Nov 23 '22 at 20:37
To my understaning, M7 features a superscalar pipeline, capable of executing more than one instruction per cycle, which makes it impossible to provide a simple instuction timing table. Maybe someday ARM will publish official guidelines, but now there are only some discussions and articles like [this](https://www.quinapalus.com/cm7cycles.html) — Flexz, Nov 24 '22 at 06:00
32-bit instruction width isn't a problem; https://en.wikipedia.org/wiki/ARM_Cortex-M#Cortex-M7 says the instruction bus is 64 bits wide. Also, M7 can optionally have data and/or instruction cache. (Wikipedia also agrees that M7 has a superscalar pipeline, so yes with correct branch prediction, it could well execute this at 1 cycle per iteration, 2 IPC, once the branch predictor warms up or whatever other effects need to happen for it to get into the optimal throughput state.) — Peter Cordes, Nov 24 '22 at 07:14
https://stackoverflow.com/questions/73690767/profiling-memcpy-performance-on-cortex-m7-stm32f7/73694265 here is one, I think I turn off the branch prediction, etc...definitely demonstrate alignment which is a big problem on the m7 (for trying to benchmark/profile). the other cortex-ms may have an optional word fetch which has alignment problems but others have a halfword fetch and at least that problem is avoided...not that it is a bad problem it is just that you have to understand performance and manage expectations. — old_timer, Nov 24 '22 at 13:48
no need to use the DWT_ timer on these cores for performance systick is easier and gives the same results (yes, true some run systick at half rate, but still it is more than adequate for profiling and more often available) — old_timer, Nov 24 '22 at 14:06
Slightly related question (and answer), although about the Cortex-M4: [What is conditional assembly branch instruction duration for different situations in ARM Cortex-M4?](https://stackoverflow.com/questions/70153316/) — wovano, Nov 27 '22 at 09:02

score 2 · Answer 1 · edited Nov 27 '22 at 08:34

You are on an STM32, so there is a flash cache and prefetcher. If you are running from flash then that will affect your results.

That particular chip also requires flash wait states depending on clock frequency and voltage further affecting your fetch rates.

The Cortex-M7 has a good-sized fetch line and where small loops are aligned, this can/will have dramatic effect (tens of a percent to double the execution time for the same machine code) on the overall performance.

The Cortex-M7 has a branch predictor, not sure they use that term though, but it is there and if I remember right it is enabled by default.

This is not a PIC. We do not look at instructions and count clocks, we write applications and then profile them if needed. Particularly on architectures/cores like these, adding or removing a single line of high-level language code can have as much double digit percent performance changes in either direction. Folks have argued with me that these cores are in fact predictable, and they are in the sense that the same code sequence without other non-deterministic effects, will run the same number of clocks, and it will. I have demonstrated that many times. But add a NOP to change the alignment of that code, and the number of clocks for that code can change, and that can be by a dramatic amount, resulting in a different, consistent, number of clocks. These are pipelined processors, although not very deep (for the Cortex-M0 and such) and that means not predictable (from inspecting instructions and counting cycles like the good old days).

You also have systemic effects. ARM makes processor cores, IP, not chips. The chip vendor plays a huge role in the execution performance (same goes for x86 — we have not been processor bound for a long time), how those buses are handled and the IP for their flash and SRAM that they buy, arbitration, etc. So as stated above ST does things different from TI and NXP with respect to their Cortex-M products, and all of them are going to have flash performance side effects, even with a zero wait state that typically means half the processor clock speed. The same code in flash with side effects disabled (have to use a TI or maybe NXP, cannot do this with ST), zero wait state on the flash, the performance is half that of SRAM for the same machine code, same alignment (at least I have seen that on a number of products, with ST you can play some games to flush the cache and take a single run at the code).

If your goal is to see if the Cortex-M7 is superscaler, fill the SRAM with hundreds of instructions, thousands. then loop that, one big massive loop that is 99.99...% the instruction under test. Turn off branch prediction and any caching (at that point the few clocks of branch prediction should really be in the wash) and see what you see. I read the databook and datasheet for you for this question, but I did not go back and see what the SRAM performance is. High-performance cores like ones from ARM are going to have sensitivities to the system, fetching and loads and stores. MCUs make it worse with clock domains, and peripherals are a whole other deal (sampling a GPIO pin in a loop is not expected to be as fast as most people think).

The compilers do not know the system either. They will do a PC relative load to pull a difficult constant (0x12345678) into a register instead of the Thumb-2 extensions of I can't remember off hand MOVT or something load half then load half, 64 bits of instruction but it is a linear fetch and not a stop and do a single load cycle from slow flash costing more clocks. Programmers do not realize this either if they are trying to count clocks to increase performance. If that is your ultimate goal here.

You are not processor-bound is the bottom line. You cannot think pipeline and the instruction sequence, etc. unless you are running the core in a simulation and you have a perfect simulated memory where the read data bus responds to the read address bus on the first available clock cycle. With this core even in this situation you would still see branch prediction and fetch line alignment effects. Getting into a real MCU you always have flash issues, and sometimes SRAM issues as well as sometimes general chip glue/implementation issues.

"_I did not go back and see what the SRAM performance is_" There are several SRAM areas, but the "big" one (the AXI SRAM) has a 64-bit bus and runs on half the MCU core speed (so half of 480 MHz for the fastest MCU). If I remember correctly, I benchmarked a memcopy operation of roughly 1 GB/s on this core once. — wovano, Nov 27 '22 at 08:54
Just to add one more thing: also interrupts can affect timing of course. — wovano, Nov 27 '22 at 08:55
I generally agree with comments. But I'm still missing more information about how this pipeline works or how to optimize assembly code to run on this core... There are "developer guides" for other architectures (particularly older ones), but none (AFAIK) for M7... — user2064070, Dec 01 '22 at 12:50
@user2064070, note that "_how to explain [these] results_", "_how this pipeline works_" and "_how to optimize assembly code_" are 3 different questions. And possibly what you actually want to achieve is another question. Make sure to clarify/focus your question to get the answer you're looking for. Also note that searching for "ARM Cortex-M7 pipeline" will lead you to several documents with information about the pipeline. Regardig the optimization of assembly code: the point of above answer is "_We do not look at instructions and count clocks, we write applications and then profile them._" — wovano, Dec 01 '22 at 15:20

ARM Cortex-M7 assembly timing on simple delay loop - how to explain results?

1 Answers1