Why does my Cortex-M4 assembly run slower than predicted?

Question

I'm writing some assembly code for the Cortex-M4, specifically the STM32F407VG found in the STM32F4DISCOVERY kit.

The code is extremely performance-sensitive, so I'm looking to squeeze every last cycle out of it. I have benchmarked it (using the DWT cycle counter available in the Cortex-M4) and for a certain size of input, it runs at 1494 cycles. The code runs from flash, and the CPU is downclocked to 24 MHz to ensure true zero-wait-state accesses to flash (ART Accelerator disabled). Benchmarking two back-to-back reads of the DWT cycle counter results in a single cycle, so that's the sole overhead related to benchmarking.

The code only reads 5 constant 32-bit words from flash (which might cause bus matrix contention if reading both instructions and data from flash); all other data memory accesses are made from/to RAM. I've ensured all branch targets are 32-bit aligned, and manually added .W suffixes to certain instructions to eliminate all, except for two, 32-bit instructions that are 16- but not 32-bit aligned -- one of which doesn't even run for this input size, and the second is the final POP instruction of the function, which obviously doesn't run in a loop. Note that the majority of instructions use 32-bit encoding: indeed the average instruction length is 3.74 bytes.

I also made a spreadsheet, accounting for every single instruction of my code, how many times they were run if inside a loop, and even accounting for whether each branch was taken or not taken, since that affects how many cycles a given instruction takes. I read the Cortex-M4 technical reference manual (TRM) to obtain cycle counts for each instruction, and always used the most conservative estimate: where an instruction depends on the cost of a pipeline flush, I assumed it takes the maximum 3 cycles; also, I assumed the worst case for all loads and stores, despite many special cases discussed in section 3.3.2 of the TRM that might actually reduce these counts. My spreadsheet includes the cost of every instruction in between both reads of the DWT cycle counter.

Thus, I was very surprised to learn that my spreadsheet predicts the code should run in 1268 cycles (recall the actual performance is 1494 cycles). I am at a loss to explain why the code runs 18% slower than the supposedly worst case according to instruction timings. Even fully unrolling the main loop of the code (which should be responsible for ~3/4 of the execution time) only brings it down to 1429 cycles -- and quickly adjusting the spreadsheet indicates that this unrolled version should run in 1186 cycles.

What's interesting is that a fully unrolled, carefully tuned C version of the same algorithm runs in 1309 cycles. It has 1013 instructions in total, whereas the fully unrolled version of my assembly code has 930 instructions. In both cases there is some code that handles a case which is not exercised by the particular input used for benchmarking, but there should be no significant differences between the C and assembly versions with regards to this unused code. Finally, the average instruction length of the C code is not significantly smaller: 3.59 cycles.

So: what could possibly be causing this non-trivial discrepancy between predicted and actual performance in my assembly code? What could the C version be possibly doing to run faster, despite a larger number of instructions with similar (a little smaller, but not by much) mix of 16- and 32-bit instructions?

Minimal reproducible example

As requested, here is a suitably anonymized minimal reproducible example. Because I isolated a single section of code, the error between prediction and actual measurements decreased to 12.5% for the non-unrolled version (and even less for the unrolled version: 7.6%), but I still consider this a bit high, especially the non-unrolled version, given the simplicity of the core and the use of worst-case timings.

First, the main assembly function:

// #define UNROLL

    .cpu cortex-m4
    .arch armv7e-m
    .fpu softvfp
    .syntax unified
    .thumb

.macro MACRO r_0, r_1, r_2, d
    ldr       lr, [r0, #\d]
    and     \r_0,  \r_0, \r_1, ror #11
    and     \r_0,  \r_0, \r_1, ror #11
    and       lr,  \r_0,   lr, ror #11
    and       lr,  \r_0,   lr, ror #11
    and     \r_2,  \r_2,   lr, ror #11
    and     \r_2,  \r_2,   lr, ror #11
    and     \r_1,  \r_2, \r_1, ror #11
    and     \r_1,  \r_2, \r_1, ror #11
    str       lr, [r0, #\d]
.endm

    .text
    .p2align 2
    .global  f
f:
    push    {r4-r11,lr}
    ldmia    r0, {r1-r12}

    .p2align 2

#ifndef UNROLL
    mov     lr,   #25
    push.w  {lr}

loop:
#else
.rept 25
#endif
    MACRO          r1,  r2,  r3, 48
    MACRO          r4,  r5,  r6, 52
    MACRO          r7,  r8,  r9, 56
    MACRO         r10, r11, r12, 60
#ifndef UNROLL
    ldr     lr, [sp]
    subs    lr,   lr, #1
    str     lr, [sp]
    bne     loop

    add.w   sp,   sp, #4
#else
.endr
#endif

    stmia    r0, {r1-r12}
    pop     {r4-r11,pc}

This is the main code (requires STM32F4 HAL, outputs data via SWO which can be read using ST-Link Utility or the st-trace utility from here, with the command line st-trace -c24):

#include "stm32f4xx_hal.h"

void SysTick_Handler(void) {
    HAL_IncTick();
}

void SystemClock_Config(void) {
    RCC_OscInitTypeDef RCC_OscInitStruct;
    RCC_ClkInitTypeDef RCC_ClkInitStruct;

    // Enable Power Control clock
    __HAL_RCC_PWR_CLK_ENABLE();

    // The voltage scaling allows optimizing the power consumption when the device is
    // clocked below the maximum system frequency, to update the voltage scaling value
    // regarding system frequency refer to product datasheet.
    __HAL_PWR_VOLTAGESCALING_CONFIG(PWR_REGULATOR_VOLTAGE_SCALE2);

    // Enable HSE Oscillator and activate PLL with HSE as source
    RCC_OscInitStruct.OscillatorType = RCC_OSCILLATORTYPE_HSE;
    RCC_OscInitStruct.HSEState = RCC_HSE_ON;      // External 8 MHz xtal on OSC_IN/OSC_OUT
    RCC_OscInitStruct.PLL.PLLState = RCC_PLL_ON;  // 8 MHz / 8 * 192 / 8 = 24 MHz
    RCC_OscInitStruct.PLL.PLLSource = RCC_PLLSOURCE_HSE;
    RCC_OscInitStruct.PLL.PLLM = 8;              // VCO input clock = 1 MHz / PLLM = 1 MHz
    RCC_OscInitStruct.PLL.PLLN = 192;            // VCO output clock = VCO input clock * PLLN = 192 MHz
    RCC_OscInitStruct.PLL.PLLP = RCC_PLLP_DIV8;  // PLLCLK = VCO output clock / PLLP = 24 MHz
    RCC_OscInitStruct.PLL.PLLQ = 4;              // USB clock = VCO output clock / PLLQ = 48 MHz
    if (HAL_RCC_OscConfig(&RCC_OscInitStruct) != HAL_OK) {
        while (1)
            ;
    }

    // Select PLL as system clock source and configure the HCLK, PCLK1 and PCLK2 clocks dividers
    RCC_ClkInitStruct.ClockType = RCC_CLOCKTYPE_SYSCLK | RCC_CLOCKTYPE_HCLK | RCC_CLOCKTYPE_PCLK1 | RCC_CLOCKTYPE_PCLK2;
    RCC_ClkInitStruct.SYSCLKSource = RCC_SYSCLKSOURCE_PLLCLK;  // 24 MHz
    RCC_ClkInitStruct.AHBCLKDivider = RCC_SYSCLK_DIV1;         // 24 MHz
    RCC_ClkInitStruct.APB1CLKDivider = RCC_HCLK_DIV1;          // 24 MHz
    RCC_ClkInitStruct.APB2CLKDivider = RCC_HCLK_DIV1;          // 24 MHz
    if (HAL_RCC_ClockConfig(&RCC_ClkInitStruct, FLASH_LATENCY_0) != HAL_OK) {
        while (1)
            ;
    }
}

void print_cycles(uint32_t cycles) {
    uint32_t q = 1000, t;

    for (int i = 0; i < 4; i++) {
        t = (cycles / q) % 10;
        ITM_SendChar('0' + t);
        q /= 10;
    }

    ITM_SendChar('\n');
}

void f(uint32_t *);

int main(void) {
    uint32_t x[16];

    SystemClock_Config();

    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;

    uint32_t before, after;

    while (1) {
        __disable_irq();
        before = DWT->CYCCNT;

        f(x);

        after = DWT->CYCCNT;
        __enable_irq();

        print_cycles(after - before);

        HAL_Delay(1000);
    }
}

I believe this is enough to dump into a project containing the STM32F4 HAL and run the code. The project needs to add a global #define for HSE_VALUE=8000000 since the HAL assumes a 25 MHz crystal, rather than the 8 MHz crystal actually fitted to the board.

There is a choice between unrolled and non-unrolled versions by commenting/uncommenting #define UNROLL at the start of the code.

Running arm-none-eabi-objdump on the main() function and looking at the call site:

 80009da:       4668            mov     r0, sp
        before = DWT->CYCCNT;
 80009dc:       6865            ldr     r5, [r4, #4]
        f(x);
 80009de:       f7ff fbd3       bl      8000188 <f>

        after = DWT->CYCCNT;
 80009e2:       6860            ldr     r0, [r4, #4]

Thus, the only instruction between both reads of the DWT cycle counter is bl which branches into the f() assembly function.

The non-unrolled version runs in 1536 cycles, whereas the unrolled version runs in 1356 cycles.

Here is my spreadsheet for the non-unrolled version (not accounting for the already measured 1-cycle overhead of reading the DWT cycle counter):

Instruction	Loop iters	Macro repeats	Count	Cycle count	Total cycles
bl (from main)	1	1	1	4	4
push (12 regs)	1	1	1	13	13
ldmia (12 regs)	1	1	1	13	13
mov	1	1	1	1	1
push (1 reg)	1	1	1	2	2
ldr	25	4	1	2	200
and	25	4	8	1	800
str	25	4	1	2	200
ldr	1	1	1	2	2
subs	1	1	1	1	1
str	1	1	1	2	2
bne (taken)	24	1	1	4	96
bne (not taken)	1	1	1	1	1
stmia (12 regs)	1	1	1	13	13
pop (11 regs + pc)	1	1	1	16	16
					1364

The last column is just the product of the 2nd through 5th columns of the table, and the last row is a sum of all values in the "Total" column. This is the predicted execution time.

Thus, for the non-unrolled version: 1536/(1364 + 1) - 1 = 12.5% error (the + 1 term is to account for the DWT cycle counter overhead).

As for the unrolled version, a few instructions must be removed from the table above: the loop setup (mov and push (1 reg)) and the loop increment and branch (ldr, subs, str and bne, both taken and not taken). This works out to 105 cycles, so the predicted performance would be 1259 cycles.

For the unrolled version, we have 1356/(1259 + 1) - 1 = 7.6% error.

Have you tested your asm to make sure it's running the expected number of iterations? Either with HW performance counters if they exist, or maybe in a simulator. — Peter Cordes, May 12 '23 at 05:28
I don't supposed you can make a [mcve] of a loop that runs slower than the predicted worst case? It doesn't have to compute anything useful for that test, just loop a known number of times, so you can just start stripping out instructions. Hopefully your spreadsheet makes it easy to make predictions for a subset of the instructions. If the ratio of actual vs. predicted cycles goes up, you're zeroing in on the slow part. And as long as actual > predicted, you're still on the path to a MCVE test case small enough to edit into your question. — Peter Cordes, May 12 '23 at 05:30
Belief that assembly code automatically is faster than e.g. C or C++ code, is a red herring. Even for extremely time-critical embedded systems a good optimizing C compiler more often than not creates code that outperforms hand-crafted assembly. And instruction-count isn't a good measurement on modern CPU's. I don't know how experienced you are with ARM assembly, but to be able to outperform an optimizing C compiler you need quite a lot of years of advanced experience with such systems. — Some programmer dude, May 12 '23 at 05:31
As for the first question, the code is well covered by unit tests so that’d be practically impossible to be wrong, but I’ll double check. The suggestion about an MRE is excellent and I’ll be sure to do it. — swineone, May 12 '23 at 05:32
@Someprogrammerdude I know for Joe Q. Public that’s excellent advice, but I have multiple decades of experience writing high-performance assembly code — although this is my first serious foray into high-performance Cortex-M4 code, hence my question so I can learn about the hard-to-find/undocumented pitfalls. Cortex-M4 in particular is a very simple, short-pipeline, non-superscalar core, so lacking any of these pitfalls, it should be possible to predict its performance to within a few % at most. 18% error strongly suggests an effect I’ve failed to take into account. — swineone, May 12 '23 at 05:40
It's a bit hard to say what the problem is without seeing your code. If you can't show it, could you perhaps produce a [mcve] exhibiting the same effect? — fuz, May 12 '23 at 09:06
You have added .w suffixes to make the average instruction length 3.74? You should be aiming for an average instruction length as close as possible to 2. This is a good question, and I've done plenty of this sort of thing myself, so if you can edit the question to include code, I will have another look! — Tom V, May 12 '23 at 09:35
@TomV perhaps I was unclear. I added the suffix here and there to ensure alignment without an explicit `NOP` instruction, which might add an extra cycle to execution. The average instruction encoding is close to 3.74 bytes because the main loop of my code uses mostly shifted-register data-processing instructions, which AFAIK can only be encoded in 32 bits. — swineone, May 12 '23 at 11:07
what situations would there be a nop? thumb2 extensions are and can be unaligned. — old_timer, May 12 '23 at 15:06
have you added nops to change the alignment of the tight loops? — old_timer, May 12 '23 at 15:07
did you setup the clock/wait states or let a library do it? did you yourself insure minimal wait states? — old_timer, May 12 '23 at 15:09
you really need to take one loop or code section, particularly if unrolled vs rolled loops appear to have a performance difference then that is a good place to start. and share that minimal code that demonstrates the problem. — old_timer, May 12 '23 at 15:14
have you done small loop experiments to confirm that art is really off? easier to just run in sram to avoid flash and ART, get a baseline then accept flash performance issues. — old_timer, May 12 '23 at 15:19
As requested, I have added a minimal reproducible example. @old_timer I think my `HAL_RCC_ClockConfig(..., FLASH_LATENCY_0)` call should turn off the ART, but I'll double-check if anything else needs to be done. — swineone, May 12 '23 at 17:11
@old_timer regarding the ART accelerator: I've read [RM0090](https://www.st.com/resource/en/reference_manual/dm00031020-stm32f405-415-stm32f407-417-stm32f427-437-and-stm32f429-439-advanced-arm-based-32-bit-mcus-stmicroelectronics.pdf), and my understanding is that bits 8 and 9 of the `FLASH_ACR` register (section 3.9.1) should be clear. I checked in a debugger and they already were. I tried setting them, which made no difference: the timings remained identical, without a single cycle of deviation. I also tried clearing them (despite the fact that they should be clear already), no difference. — swineone, May 12 '23 at 17:30

score 5 · Answer 1 · edited May 25 '23 at 23:43

you are making assumptions about overall timing based on instruction timing in the document. The processor has not been driving performance for a long time now.
you have memory accesses in your test. 2a) you have aligned and unaligned memory accesses in your test
Pretty sure the ART is on, I have tried many times to turn it off. Maybe it was a cortex-m7 that I could at least get one pass with it off or something, cannot remember. Need to run from sram not flash.
zero wait states does not mean zero wait states flash is often a couple of clocks per if not more (with zero EXTRA wait states). Difficult to impossible to determine on STM32 parts. ti and others that do not have this flash cache (ART) performance is much easier to see.
and other stuff.

I do not know what you mean by nops related to normal thumb instructions and forcing thumb2 extensions. Where are these nops?

Excellent work BTW, I am not dismissing that in any way. Just wanted to add some extra info that I cannot tell if you timed or not, since your tests are definitely touching system timing issues and are beyond the instruction timing.

So the ARM ARM and ARM TRM for the cortex-m4.

Instruction fetches from Code memory space, 0x00000000 to 0x1FFFFFFC , are performed over the 32-bit AHB-Lite bus.

All fetches are word-wide. The number of instructions fetched per word depends on the code running and the alignment of the code in memory.

Well instructions are either one halfword or two so 16 or 32 bits total, and we can use that information to cause a performance hit (especially if you force all instructions to be thumb2 extensions)

I can provide complete 100% sources in this answer since I use no libraries in my test. Processor is slow enough to have "zero wait states" on the flash, running 8Mhz off the crystal only to let the uart that prints results be more accurate, otherwise the internal clock is fine. NUCLEO-F411RE so should be the same m4 core that they purchased for the F4 discovery. I have some of those original f4 discoveries laying around here somewhere as well as a few of the cheap clones, but the nucleo is so much easier and had it laying nearby.

Most of the time and certainly in this case you do not need to mess with the DWT cycle count as the systick gives the same answer, some implementations (other vendors if any) may divide the system clock into the systick (if there is a systick)(might not be a dwt either) but not in this case, I get the same results with either and systick is slightly easier so...

    ldr r2,[r0]
loop:
    subs r1,#1
    bne loop
    ldr r3,[r0]
    subs r0,r2,r3
    bx lr

start with a simple loop pass in the timer register (systick in this case, swap r2,r3 if dwt cycle count) to measure right around the loop under test.

hexstring(STK_MASK&TEST(STK_CVR,0x1000));
hexstring(STK_MASK&TEST(STK_CVR,0x1000));


 800011e:   6802        ldr r2, [r0, #0]

08000120 <loop>:
 8000120:   f1b1 0101   subs.w  r1, r1, #1
 8000124:   f47f affc   bne.w   8000120 <loop>
 8000128:   6803        ldr r3, [r0, #0]
 800012a:   1ad0        subs    r0, r2, r3
 800012c:   4770        bx  lr
 800012e:   bf00        nop


00003001 
00003001

thumb2 extensions, the loop itself is aligned (on an 8 word boundary).

 800011e:   6802        ldr r2, [r0, #0]

08000120 <loop>:
 8000120:   3901        subs    r1, #1
 8000122:   d1fd        bne.n   8000120 <loop>
 8000124:   6803        ldr r3, [r0, #0]
 8000126:   1ad0        subs    r0, r2, r3
 8000128:   4770        bx  lr
 800012a:   bf00        nop

00003001 
00003001

Thumb instructions, it doesn't matter at this point:

 8000120:   6802        ldr r2, [r0, #0]

08000122 <loop>:
 8000122:   3901        subs    r1, #1
 8000124:   d1fd        bne.n   8000122 <loop>
 8000126:   6803        ldr r3, [r0, #0]
 8000128:   1ad0        subs    r0, r2, r3
 800012a:   4770        bx  lr


00003001 
00003001

change the alignment by a halfword, thumb instructions, does not change results

 8000120:   6802        ldr r2, [r0, #0]

08000122 <loop>:
 8000122:   f1b1 0101   subs.w  r1, r1, #1
 8000126:   f47f affc   bne.w   8000122 <loop>
 800012a:   6803        ldr r3, [r0, #0]
 800012c:   1ad0        subs    r0, r2, r3
 800012e:   4770        bx  lr

00004000 
00004000

thumb2 extensions unaligned, we see the extra fetch, or assume it is the extra fetch.

I have not been able to turn off the ART in the years since the STM32s came out. The prefetch bit in the flash acr does not affect the results here. Let's run from sram as well as flash.

 800011e:   6802        ldr r2, [r0, #0]

08000120 <loop>:
 8000120:   f1b1 0101   subs.w  r1, r1, #1
 8000124:   f47f affc   bne.w   8000120 <loop>
 8000128:   6803        ldr r3, [r0, #0]
 800012a:   1ad0        subs    r0, r2, r3
 800012c:   4770        bx  lr


00003001  flash
00003001 
00005FFF  sram
00005FFF

thumb2 extensions, aligned.

 8000120:   6802        ldr r2, [r0, #0]

08000122 <loop>:
 8000122:   f1b1 0101   subs.w  r1, r1, #1
 8000126:   f47f affc   bne.w   8000122 <loop>
 800012a:   6803        ldr r3, [r0, #0]
 800012c:   1ad0        subs    r0, r2, r3
 800012e:   4770        bx  lr


00004000  flash
00004000 
00007FFD  sram
00007FFD

thumb2 extensions, unaligned, we see what is assumed to be that extra fetch.

 8000120:   6802        ldr r2, [r0, #0]

08000122 <loop>:
 8000122:   3901        subs    r1, #1
 8000124:   d1fd        bne.n   8000122 <loop>
 8000126:   6803        ldr r3, [r0, #0]
 8000128:   1ad0        subs    r0, r2, r3
 800012a:   4770        bx  lr

00003001 
00003001 
00005FFD 
00005FFD

thumb, unaligned

 800011e:   6802        ldr r2, [r0, #0]

08000120 <loop>:
 8000120:   3901        subs    r1, #1
 8000122:   d1fd        bne.n   8000120 <loop>
 8000124:   6803        ldr r3, [r0, #0]
 8000126:   1ad0        subs    r0, r2, r3
 8000128:   4770        bx  lr

00003001 
00003001 
00004001 
00004001

thumb aligned, that is very interesting. we will see that later

subs 1
bne taken 4
bne not taken 1

subs          0x1000  0x1000        0x1000
bne taken     0x0FFF  0x1FFE up to  0x3FFC
bne not taken 0x0001  0x0001        0x0001
                     ==========     =======
                      0x2FFF        0x4FFD

your test has a lot of stuff in it that I think was not needed, and you had aligned and unaligned loads and stores mixed in I separated those out, I took a portion of your test...

 800021c:   b570        push    {r4, r5, r6, lr}
 800021e:   6802        ldr r2, [r0, #0]

08000220 <loop2>:
 8000220:   ea04 24f5   and.w   r4, r4, r5, ror #11
 8000224:   ea04 24f5   and.w   r4, r4, r5, ror #11
 8000228:   ea04 2efe   and.w   lr, r4, lr, ror #11
 800022c:   ea04 2efe   and.w   lr, r4, lr, ror #11
 8000230:   ea06 26fe   and.w   r6, r6, lr, ror #11
 8000234:   ea06 26fe   and.w   r6, r6, lr, ror #11
 8000238:   ea06 25f5   and.w   r5, r6, r5, ror #11
 800023c:   ea06 25f5   and.w   r5, r6, r5, ror #11
 8000240:   3901        subs    r1, #1
 8000242:   d1ed        bne.n   8000220 <loop2>
 8000244:   6803        ldr r3, [r0, #0]
 8000246:   1ad0        subs    r0, r2, r3
 8000248:   e8bd 4070   ldmia.w sp!, {r4, r5, r6, lr}
 800024c:   4770        bx  lr

0000B001 
0000B001 
00013FFE 
00013FFE

your test is all thumb2 extensions (well three register and with rotation no doubt). aligned.

 800021c:   b570        push    {r4, r5, r6, lr}
 800021e:   6802        ldr r2, [r0, #0]

08000220 <loop2>:
 8000220:   ea04 24f5   and.w   r4, r4, r5, ror #11
 8000224:   ea04 24f5   and.w   r4, r4, r5, ror #11
 8000228:   ea04 2efe   and.w   lr, r4, lr, ror #11
 800022c:   ea04 2efe   and.w   lr, r4, lr, ror #11
 8000230:   ea06 26fe   and.w   r6, r6, lr, ror #11
 8000234:   ea06 26fe   and.w   r6, r6, lr, ror #11
 8000238:   ea06 25f5   and.w   r5, r6, r5, ror #11
 800023c:   ea06 25f5   and.w   r5, r6, r5, ror #11
 8000240:   3901        subs    r1, #1
 8000242:   d1ed        bne.n   8000220 <loop2>
 8000244:   6803        ldr r3, [r0, #0]
 8000246:   1ad0        subs    r0, r2, r3
 8000248:   e8bd 4070   ldmia.w sp!, {r4, r5, r6, lr}
 800024c:   4770        bx  lr
 800024e:   bf00        nop

0000C001 
0000C001 
00015FFD 
00015FFD

unaligned so we do not see an extra fetch (assuming that is what it is) per instruction, just one for the whole loop. Which further reinforces that is an extra fetch due to alignment.

 8000220:   b570        push    {r4, r5, r6, lr}
 8000222:   6802        ldr r2, [r0, #0]

08000224 <loop2>:
 8000224:   ea04 24f5   and.w   r4, r4, r5, ror #11
 8000228:   ea04 24f5   and.w   r4, r4, r5, ror #11
 800022c:   ea04 2efe   and.w   lr, r4, lr, ror #11
 8000230:   ea04 2efe   and.w   lr, r4, lr, ror #11
 8000234:   ea06 26fe   and.w   r6, r6, lr, ror #11
 8000238:   ea06 26fe   and.w   r6, r6, lr, ror #11
 800023c:   ea06 25f5   and.w   r5, r6, r5, ror #11
 8000240:   ea06 25f5   and.w   r5, r6, r5, ror #11
 8000244:   3901        subs    r1, #1
 8000246:   d1ed        bne.n   8000224 <loop2>
 8000248:   6803        ldr r3, [r0, #0]
 800024a:   1ad0        subs    r0, r2, r3
 800024c:   e8bd 4070   ldmia.w sp!, {r4, r5, r6, lr}
 8000250:   4770        bx  lr
 8000252:   bf00        nop


0000B001 
0000B001 
00013FFE 
00013FFE

I went ahead and moved it one more halfword to only a 1 word alignment instead of 8 word. Maybe the ART would be affected, but not expecting sram to change. Neither were effected (on bigger processors like the full sized arms this would have a different result as the fetches are like 4 or 8 words at a time and you have a lot of alignment plus branch prediction sensitive spots that cause multiple different performance numbers for the same machine code).

You had some loads and stores, and unless I read the code wrong you had an array of 16 words but did not initialize them. Yet used them. This is not floating point nor multiply/divide so do not expect any clock savings based on data content. I guess you were not exceeding the stack/this array as I might have mentioned at the top of this answer...

 8000318:   b430        push    {r4, r5}
 800031a:   f04f 5500   mov.w   r5, #536870912  ; 0x20000000
 800031e:   6802        ldr r2, [r0, #0]

08000320 <loop3>:
 8000320:   686c        ldr r4, [r5, #4]
 8000322:   606c        str r4, [r5, #4]
 8000324:   3901        subs    r1, #1
 8000326:   d1fb        bne.n   8000320 <loop3>
 8000328:   6803        ldr r3, [r0, #0]
 800032a:   1ad0        subs    r0, r2, r3
 800032c:   bc30        pop {r4, r5}
 800032e:   4770        bx  lr

00005001 
00005001 
00008FFE 
00008FFE

Nice and pretty and aligned. This is our baseline.

 8000318:   b430        push    {r4, r5}
 800031a:   f04f 5500   mov.w   r5, #536870912  ; 0x20000000
 800031e:   6802        ldr r2, [r0, #0]

08000320 <loop3>:
 8000320:   f8d5 4005   ldr.w   r4, [r5, #5]
 8000324:   f8c5 4005   str.w   r4, [r5, #5]
 8000328:   3901        subs    r1, #1
 800032a:   d1f9        bne.n   8000320 <loop3>
 800032c:   6803        ldr r3, [r0, #0]
 800032e:   1ad0        subs    r0, r2, r3
 8000330:   bc30        pop {r4, r5}
 8000332:   4770        bx  lr

0000A001 
0000A001 
0000FFFF 
0000FFFF

Unalign by one byte, if the core even supports it, if the trap is disabled, etc etc etc. Takes longer as expected. Considerably longer, can start to feel how long the sram cycles are from these tests.

 8000318:   b430        push    {r4, r5}
 800031a:   f04f 5500   mov.w   r5, #536870912  ; 0x20000000
 800031e:   6802        ldr r2, [r0, #0]

08000320 <loop3>:
 8000320:   f8d5 4006   ldr.w   r4, [r5, #6]
 8000324:   f8c5 4006   str.w   r4, [r5, #6]
 8000328:   3901        subs    r1, #1
 800032a:   d1f9        bne.n   8000320 <loop3>
 800032c:   6803        ldr r3, [r0, #0]
 800032e:   1ad0        subs    r0, r2, r3
 8000330:   bc30        pop {r4, r5}
 8000332:   4770        bx  lr

00008001 
00008001 
0000DFFF 
0000DFFF

halfword aligned but not word aligned, word cycles. This is very interesting, this is not expected. Have to check the documentation.

The processor provides three primary bus interfaces implementing a variant of the AMBA 3 AHB-Lite protocol

Data and debug accesses to Code memory space, 0x00000000 to 0x1FFFFFFF , are performed over the 32-bit AHB-Lite bus.

So it is 32 bit from the ARM side but the chip vendor can do whatever they want so maybe their sram is built from 16 bit wide blocks, who knows.

 8000318:   b430        push    {r4, r5}
 800031a:   f04f 5500   mov.w   r5, #536870912  ; 0x20000000
 800031e:   6802        ldr r2, [r0, #0]

08000320 <loop3>:
 8000320:   f8d5 4007   ldr.w   r4, [r5, #7]
 8000324:   f8c5 4007   str.w   r4, [r5, #7]
 8000328:   3901        subs    r1, #1
 800032a:   d1f9        bne.n   8000320 <loop3>
 800032c:   6803        ldr r3, [r0, #0]
 800032e:   1ad0        subs    r0, r2, r3
 8000330:   bc30        pop {r4, r5}
 8000332:   4770        bx  lr

0000A001 
0000A001 
0000FFFF 
0000FFFF

Now as expected, this alignment is also much worse than being properly aligned.

MACRO          r1,  r2,  r3, 48   aligned
MACRO          r4,  r5,  r6, 52   unaligned
MACRO          r7,  r8,  r9, 56   unaligned
MACRO         r10, r11, r12, 60   aligned

these unaligned accesses are going to create extra clocks. among possible other things.

 8000318:   b430        push    {r4, r5}
 800031a:   f04f 5500   mov.w   r5, #536870912  ; 0x20000000
 800031e:   6802        ldr r2, [r0, #0]

08000320 <loop3>:
 8000320:   f3af 8000   nop.w
 8000324:   f3af 8000   nop.w
 8000328:   3901        subs    r1, #1
 800032a:   d1f9        bne.n   8000320 <loop3>
 800032c:   6803        ldr r3, [r0, #0]
 800032e:   1ad0        subs    r0, r2, r3
 8000330:   bc30        pop {r4, r5}
 8000332:   4770        bx  lr

00005000 
00005000 
00007FFF 
00007FFF

vs

nops instead of ldr/str. Not necessarily helping us with the measurement of the ldr/str instructions. But I do not see it as being a fixed 2 instructions for each all the time.

Now obviously compiled code is going to take advantage of the thumb instructions when it can. Creating a mixture of thumb and thumb2, ideally mostly thumb. So it will be or can be fewer fetches for the same number of instructions. Unrolling of course saves you number of loops times some number of clocks (oh, right I tried BPIALL and I saw no effect, I think in the -m7 you can mess with branch prediction, if there even is any in the m4 or m3, etc)(can definitely see it in the full sized arms and other processors combined with alignment again doubles the different performance measurements for the same machine code)(net result benchmarks are BS, and cannot count instructions and figure out clocks for the last couple of decades or so) so you will save those extra loop branching clocks. Linear code with no branches even with extra instructions is often going to be the fastest.

I am not going to completely repeat your experiment as written. I think I have provided some info to chew on and certainly I think your ldr/str timing is wrong. I do not believe it is 2 clocks per instruction in all cases. (you are also pushing/popping your loop counter against memory causing possibly an extra uncounted clock or few per loop). I also think that the ART is on and cannot be turned off so you are getting some slow flash plus their prefetch cache thing feeding the core which makes measurements like this that much more difficult to control/understand. While ti and nxp may have purchased different revisions of the m4 (I have not looked in a while to see if arm even released more than one) and there are always vendor customizations. I do remember that the ti does not have a magic flash cache like st. It may have an actual data cache implemented that makes the above even more fun multiplying again the performance measurements on the same machine code. But you may get a feel for what m4s in a different system do compared to your expectations. I think part of the problem is the expectations and that in part has to do with for many platforms we have not been able to count clocks from instructions for decades and the system itself plays a big role in performance over and above the processor. mcus are cheap and fast enough and not necessarily high performance machines (not that our desktops are either) the nature of the modern buses which are very much not one cycle per anything, combined with a pipeline, fetching alone creates often unmeasurable chaos. Before others chime in I will agree that so far on these cortex-m platforms a specific binary build, without things like interrupts/etc getting in the way, the performance of a binary is consistent if you do not change any variables. But you can recompile that program with what appears to be something that has nothing to do with anything, could be in a file not even related to the code it effects, and see a dramatic performance difference with the next build.

Unaligned ldr/strs alone can easily account for the 200 clock count difference.

Bottom line, the processor is only part of the system, we are not (completely) processor bound so its timing does not determine performance (cannot no longer use/rely on instruction timing documentation). I think as a result of that there is some expectation issues and there are some extra clocks sneaking in here and there, one or two digits percent worth of performance expectations from system issues and not processor issues.

The C compiler using thumb and thumb2 extensions even if the same number of instructions may or may not execute faster, but you do have fewer fetches to bury in the pipe or stall the pipe. Compared to forcing one instruction per fetch.

Edit

Based on your comment, using SYSCFG_MEMRMP (thanks for educating me on this register).

A particular test

00003001 flash
00003001 flash
00004001 sram
00004001 sram
00003001 sram through icode bus
00003001 sram through icode bus

so it works and thanks for the info. Won't go through this whole answer again, but good to know for the future.

michael abrash, zen of assembly language, easily obtained. Applies to this and every other performance issue/question of my own or others I have seen since it was published. — old_timer, May 12 '23 at 22:42
Fantastic information, will take a couple of days to digest. However, I think running code from RAM is not really a good solution to the benchmarking problem, due to bus matrix contention with data accesses from RAM. For instance, my linker script provides a `.RamFunc` section, so I tried mounting my code there (and checked with the debugger that the PC is in the correct range). Result: the unrolled version shot up from 1429 cycles to 2481 cycles. — swineone, May 12 '23 at 23:16
I have had better demonstrations at times on sram than on flash, esp with stm32 parts, but my loops were trivial to copy and run, your project is not assumed to be. End of the day you likely want it in flash anyway and want to just get the best you can out of it... — old_timer, May 12 '23 at 23:18
Naturally if you take the clock faster compared to wall clock time the overall execution can go up and the art or caches or whatever can attempt to compensate for the wait states if any...if you are trying to meet some hard real time performance. — old_timer, May 12 '23 at 23:21
Continuing from my previous comment: I remapped the stack to the core-coupled RAM (CCM) to reduce contention for regular RAM. Performance improved to 1967 cycles, but still much worse than flash, which puzzles me somewhat. Although it just occurred to me: if indeed the ART accelerator is always on, then it kinda compensates for any halfword-aligned 32-bit instruction, whereas the code in RAM would have to break it up over two loads. Let me try to add some more `.align` directives and I'll get back to you. — swineone, May 12 '23 at 23:27
Just made a *very* interesting discovery: moving the stack to CCM and running the code from flash improved the performance from 1431 cycles to 1308 cycles. While not exactly the 1260 cycles my spreadsheet predicts, we're getting closer. I'll play around with `.align` directives now, I'm positive I can get this to match much more closely. — swineone, May 12 '23 at 23:37
I was able to run the code from RAM at 1310 cycles down from 1967. Saw this in section 2.3.1 of RM0090: "The CPU can access the SRAM1, SRAM2, and SRAM3 through the System Bus or through the I-Code/D-Code buses when boot from SRAM is selected or when physical remap is selected (...) To get the max performance on SRAM execution, physical remap should be selected (boot or software selection)." (to be continued) — swineone, May 13 '23 at 02:40
After a few hours of hacking linker scripts, using obscure GCC attributes and even masking out a bit from a function pointer, I managed to run the function from an address starting with 0x0000... rather than 0x2000... This uses the higher performance I-code bus rather than the System Bus, and so performance improves to 1310 cycles. Still, this is just the same performance I get from running from Flash, as long as CCMRAM is used the stack. Still no idea why that is the case. — swineone, May 13 '23 at 02:41

score 2 · Accepted Answer · answered Jun 27 '23 at 05:53

I put this problem aside for a while, and after a few hours of looking at it with a fresh outlook, I was able to actually beat the worst-case timing predictions shown in the question (emphasizing that they are worst case, so it is not unexpected that they can be beaten). There are two entirely separate issues at play, and I will treat them one at a time.

Issue 1: fix slow SRAM access

First of all, as seen in the comments to the existing answers, one trick that I discovered was to map the stack to CCMRAM. However, this never made sense to me, unless going through the STM32F407's bus matrix introduced delays, something I found no evidence for.

It turns out that my hunch was correct: it is possible to achieve full speed without involving CCMRAM. The key is Figure 1 in Section 2 of STM32F407's reference manual:

Additionally, note the following remarks in the Cortex-M4 Technical Reference Manual, Section 2.3.1 ("Bus interfaces"):

System interface

Instruction fetches and data and debug accesses to address ranges 0x20000000 to 0xDFFFFFFF and 0xE0100000 to 0xFFFFFFFF are performed over the 32-bit AHB-Lite bus.

For simultaneous accesses to the 32-bit AHB-Lite bus, the arbitration order in decreasing priority is:

Data accesses.

Instruction and vector fetches.

Debug.

The system bus interface contains control logic to handle unaligned accesses, FPB remapped accesses, bit-band accesses, and pipelined instruction fetches.

Pipelined instruction fetches

To provide a clean timing interface on the system bus, instruction and vector fetch requests to this bus are registered.

This results in an extra cycle of latency because instructions fetched from the system bus take two cycles. This also means that back-to-back instruction fetches from the system bus are not possible.

(Emphasis on the last paragraph by myself.) Note thus that access through the system bus (S-bus in the figure above) take an extra cycle. Also note, by looking at the bus matrix, that there is no connection between the core's D-bus and SRAM2, only SRAM1. As per the STM32F407's reference manual, SRAM2 corresponds to addresses in the range 0x2001C000-0x2001FFFF, i.e. the last 16 KB in the 128 KB block of regular, non-CCM RAM.

Now combine this with the usual technique for initializing the stack pointer in the linker script (relevant sections quoted from a linker script which, to the best of my recollection, comes directly from ST):

MEMORY
{
  CCMRAM    (xrw)    : ORIGIN = 0x10000000,   LENGTH = 64K
  RAM       (xrw)    : ORIGIN = 0x20000000,   LENGTH = 128K
  FLASH     (rx)     : ORIGIN = 0x08000000,   LENGTH = 1024K
}

/* Highest address of the user mode stack */
_estack = ORIGIN(RAM) + LENGTH(RAM); /* end of "RAM" Ram type memory */

As written, this ensures the stack pointer will start at 0x20020000, and thus the first 16 KB of the stack will fall directly in SRAM2, which is slower. While this would generally be a sound strategy to avoid stack overflows (statically allocating variables starting from the lowest RAM address, while setting the stack pointer to the highest RAM address, creating the largest possible gap between both), it results in serious performance implications.

Indeed, just by relocating the stack pointer to SRAM1, I was able to reduce the execution time of the MRE in my question from 1536 to 1407 cycles.

The implications of this go beyond the toy example in my question; this should affect every STM32F407 firmware based on the default linker script supplied by ST. What ST did here, when factoring in the lack of a connection between the Cortex-M4 D-bus and SRAM2 in the bus matrix, and the default choice of the stack pointer, borders on criminal/grossly negligent. The amount of performance lost/energy wasted worldwide due to this, considering all shipped STM32F407 units (and possibly many other MCUs affected by this issue), is simply unthinkable. Shame on you ST!

Issue 2: pipelining of memory access instructions

In section 3.3.3 ("Load/store timings") of the Cortex-M4 Technical Reference Manual, a series of considerations are made regarding the pairing of load and store instructions. Quoting the first assertion:

STR Rx,[Ry,#imm] is always one cycle. This is because the address generation is performed in the initial cycle, and the data store is performed at the same time as the next instruction is executing. If the store is to the write buffer, and the write buffer is full or not enabled, the next instruction is delayed until the store can complete. If the store is to the write buffer, for example to the Code segment, and that transaction stalls, the impact on timing is only felt if another load or store operation is executed before completion.

Note that the macro in my MRE starts with a load and ends with a store (and this store uses the exact addressing mode referenced above); given these macros are instantiated in sequence, the store at the end of one instance is followed by a load at the beginning of the next one.

My understanding is that this write buffer is enabled by default: see Section 4.4.1 ("Auxiliary control register (ACTLR)"), bit 1 ("DISDEFWBUF") of the STM32 Cortex-M4 programming manual, and note that the reset state of this register is all 0 bits -- in the case of bit 1, the behavior is "Enable write buffer use". Also, I'd like to think the store buffer would clear after a couple of cycles, certainly quicker than the 10+ cycles between one store and the next (from the next instantiation of the macro).

Regardless of this, I decided to experiment with moving the store instruction earlier in the code stream, so that it is not adjacent to a load from the next instantiation of the macro. That is, I rewrote the macro from the MRE in the question to the following:

.macro MACRO r_0, r_1, r_2, d
    ldr       lr, [r0, #\d]
    and     \r_0,  \r_0, \r_1, ror #11
    and     \r_0,  \r_0, \r_1, ror #11
    and       lr,  \r_0,   lr, ror #11
    and       lr,  \r_0,   lr, ror #11
    and     \r_2,  \r_2,   lr, ror #11
    and     \r_2,  \r_2,   lr, ror #11
    str       lr, [r0, #\d]
    and     \r_1,  \r_2, \r_1, ror #11
    and     \r_1,  \r_2, \r_1, ror #11
.endm

This version reduced the cycle count from 1407 (after applying the fix to issue #1 above) to 1307. That's exactly 100 cycles, and I don't think it's a coincidence that the change above eliminates 100 instances of STR followed by LDR. Most importantly, by now I have beaten the original prediction of 1364 cycles from the table in the question, so at least I have reached (and indeed improved) the worst case. On the other hand, given the quote above on how STR Rx,[Ry,#imm] should always take one cycle, maybe a better estimate would be 1264 cycles, so there are still 43 cycles of difference left to explain. If anyone can further improve either the predictions or to code to reach this conjectured 1264 cycle bound, I'd be very interested to know.

Finally, this SO question and its answers may contain relevant information. I'll reread it a couple of times over the next few days to see if it provides further insights.

0___________ · Answer 3 · 2023-05-15T09:05:52.617

0

For the maximum performance you need to convert your Cortex mico into a Harvard architecture machine.

Place the code in the SRAM
Place data and stack in CCMRAM.
Do not read any data from the FLASH memory

Code memory and data memory are accessed without racing on the busses.

You will get maximum performance.

BTW Address 0 is just remapped from the memory chosen to boot from. It does not have specific bus "connected" to it.

edited May 15 '23 at 09:05

answered May 14 '23 at 20:30

0___________

60,014
4
34
74

I don't think you can place code on CCMRAM. See Figure 1 of Section 2 in [RM0090](https://www.st.com/resource/en/reference_manual/dm00031020-stm32f405-415-stm32f407-417-stm32f427-437-and-stm32f429-439-advanced-arm-based-32-bit-mcus-stmicroelectronics.pdf): CCMRAM is only connected to the D-bus. I recall having tried this years ago (for a different purpose) and arriving at the same conclusion, supported by this information in the manual. – swineone May 15 '23 at 00:35
Yes yor 40x series is not on Ibus as well - then do this right opposite way. – 0___________ May 15 '23 at 09:05

Why does my Cortex-M4 assembly run slower than predicted?

Minimal reproducible example

3 Answers3

Edit

Issue 1: fix slow SRAM access

Issue 2: pipelining of memory access instructions