I'm writing some assembly code for the Cortex-M4, specifically the STM32F407VG found in the STM32F4DISCOVERY kit.
The code is extremely performance-sensitive, so I'm looking to squeeze every last cycle out of it. I have benchmarked it (using the DWT cycle counter available in the Cortex-M4) and for a certain size of input, it runs at 1494 cycles. The code runs from flash, and the CPU is downclocked to 24 MHz to ensure true zero-wait-state accesses to flash (ART Accelerator disabled). Benchmarking two back-to-back reads of the DWT cycle counter results in a single cycle, so that's the sole overhead related to benchmarking.
The code only reads 5 constant 32-bit words from flash (which might cause bus matrix contention if reading both instructions and data from flash); all other data memory accesses are made from/to RAM. I've ensured all branch targets are 32-bit aligned, and manually added .W
suffixes to certain instructions to eliminate all, except for two, 32-bit instructions that are 16- but not 32-bit aligned -- one of which doesn't even run for this input size, and the second is the final POP
instruction of the function, which obviously doesn't run in a loop. Note that the majority of instructions use 32-bit encoding: indeed the average instruction length is 3.74 bytes.
I also made a spreadsheet, accounting for every single instruction of my code, how many times they were run if inside a loop, and even accounting for whether each branch was taken or not taken, since that affects how many cycles a given instruction takes. I read the Cortex-M4 technical reference manual (TRM) to obtain cycle counts for each instruction, and always used the most conservative estimate: where an instruction depends on the cost of a pipeline flush, I assumed it takes the maximum 3 cycles; also, I assumed the worst case for all loads and stores, despite many special cases discussed in section 3.3.2 of the TRM that might actually reduce these counts. My spreadsheet includes the cost of every instruction in between both reads of the DWT cycle counter.
Thus, I was very surprised to learn that my spreadsheet predicts the code should run in 1268 cycles (recall the actual performance is 1494 cycles). I am at a loss to explain why the code runs 18% slower than the supposedly worst case according to instruction timings. Even fully unrolling the main loop of the code (which should be responsible for ~3/4 of the execution time) only brings it down to 1429 cycles -- and quickly adjusting the spreadsheet indicates that this unrolled version should run in 1186 cycles.
What's interesting is that a fully unrolled, carefully tuned C version of the same algorithm runs in 1309 cycles. It has 1013 instructions in total, whereas the fully unrolled version of my assembly code has 930 instructions. In both cases there is some code that handles a case which is not exercised by the particular input used for benchmarking, but there should be no significant differences between the C and assembly versions with regards to this unused code. Finally, the average instruction length of the C code is not significantly smaller: 3.59 cycles.
So: what could possibly be causing this non-trivial discrepancy between predicted and actual performance in my assembly code? What could the C version be possibly doing to run faster, despite a larger number of instructions with similar (a little smaller, but not by much) mix of 16- and 32-bit instructions?
Minimal reproducible example
As requested, here is a suitably anonymized minimal reproducible example. Because I isolated a single section of code, the error between prediction and actual measurements decreased to 12.5% for the non-unrolled version (and even less for the unrolled version: 7.6%), but I still consider this a bit high, especially the non-unrolled version, given the simplicity of the core and the use of worst-case timings.
First, the main assembly function:
// #define UNROLL
.cpu cortex-m4
.arch armv7e-m
.fpu softvfp
.syntax unified
.thumb
.macro MACRO r_0, r_1, r_2, d
ldr lr, [r0, #\d]
and \r_0, \r_0, \r_1, ror #11
and \r_0, \r_0, \r_1, ror #11
and lr, \r_0, lr, ror #11
and lr, \r_0, lr, ror #11
and \r_2, \r_2, lr, ror #11
and \r_2, \r_2, lr, ror #11
and \r_1, \r_2, \r_1, ror #11
and \r_1, \r_2, \r_1, ror #11
str lr, [r0, #\d]
.endm
.text
.p2align 2
.global f
f:
push {r4-r11,lr}
ldmia r0, {r1-r12}
.p2align 2
#ifndef UNROLL
mov lr, #25
push.w {lr}
loop:
#else
.rept 25
#endif
MACRO r1, r2, r3, 48
MACRO r4, r5, r6, 52
MACRO r7, r8, r9, 56
MACRO r10, r11, r12, 60
#ifndef UNROLL
ldr lr, [sp]
subs lr, lr, #1
str lr, [sp]
bne loop
add.w sp, sp, #4
#else
.endr
#endif
stmia r0, {r1-r12}
pop {r4-r11,pc}
This is the main code (requires STM32F4 HAL, outputs data via SWO which can be read using ST-Link Utility or the st-trace utility from here, with the command line st-trace -c24
):
#include "stm32f4xx_hal.h"
void SysTick_Handler(void) {
HAL_IncTick();
}
void SystemClock_Config(void) {
RCC_OscInitTypeDef RCC_OscInitStruct;
RCC_ClkInitTypeDef RCC_ClkInitStruct;
// Enable Power Control clock
__HAL_RCC_PWR_CLK_ENABLE();
// The voltage scaling allows optimizing the power consumption when the device is
// clocked below the maximum system frequency, to update the voltage scaling value
// regarding system frequency refer to product datasheet.
__HAL_PWR_VOLTAGESCALING_CONFIG(PWR_REGULATOR_VOLTAGE_SCALE2);
// Enable HSE Oscillator and activate PLL with HSE as source
RCC_OscInitStruct.OscillatorType = RCC_OSCILLATORTYPE_HSE;
RCC_OscInitStruct.HSEState = RCC_HSE_ON; // External 8 MHz xtal on OSC_IN/OSC_OUT
RCC_OscInitStruct.PLL.PLLState = RCC_PLL_ON; // 8 MHz / 8 * 192 / 8 = 24 MHz
RCC_OscInitStruct.PLL.PLLSource = RCC_PLLSOURCE_HSE;
RCC_OscInitStruct.PLL.PLLM = 8; // VCO input clock = 1 MHz / PLLM = 1 MHz
RCC_OscInitStruct.PLL.PLLN = 192; // VCO output clock = VCO input clock * PLLN = 192 MHz
RCC_OscInitStruct.PLL.PLLP = RCC_PLLP_DIV8; // PLLCLK = VCO output clock / PLLP = 24 MHz
RCC_OscInitStruct.PLL.PLLQ = 4; // USB clock = VCO output clock / PLLQ = 48 MHz
if (HAL_RCC_OscConfig(&RCC_OscInitStruct) != HAL_OK) {
while (1)
;
}
// Select PLL as system clock source and configure the HCLK, PCLK1 and PCLK2 clocks dividers
RCC_ClkInitStruct.ClockType = RCC_CLOCKTYPE_SYSCLK | RCC_CLOCKTYPE_HCLK | RCC_CLOCKTYPE_PCLK1 | RCC_CLOCKTYPE_PCLK2;
RCC_ClkInitStruct.SYSCLKSource = RCC_SYSCLKSOURCE_PLLCLK; // 24 MHz
RCC_ClkInitStruct.AHBCLKDivider = RCC_SYSCLK_DIV1; // 24 MHz
RCC_ClkInitStruct.APB1CLKDivider = RCC_HCLK_DIV1; // 24 MHz
RCC_ClkInitStruct.APB2CLKDivider = RCC_HCLK_DIV1; // 24 MHz
if (HAL_RCC_ClockConfig(&RCC_ClkInitStruct, FLASH_LATENCY_0) != HAL_OK) {
while (1)
;
}
}
void print_cycles(uint32_t cycles) {
uint32_t q = 1000, t;
for (int i = 0; i < 4; i++) {
t = (cycles / q) % 10;
ITM_SendChar('0' + t);
q /= 10;
}
ITM_SendChar('\n');
}
void f(uint32_t *);
int main(void) {
uint32_t x[16];
SystemClock_Config();
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
uint32_t before, after;
while (1) {
__disable_irq();
before = DWT->CYCCNT;
f(x);
after = DWT->CYCCNT;
__enable_irq();
print_cycles(after - before);
HAL_Delay(1000);
}
}
I believe this is enough to dump into a project containing the STM32F4 HAL and run the code. The project needs to add a global #define
for HSE_VALUE=8000000
since the HAL assumes a 25 MHz crystal, rather than the 8 MHz crystal actually fitted to the board.
There is a choice between unrolled and non-unrolled versions by commenting/uncommenting #define UNROLL
at the start of the code.
Running arm-none-eabi-objdump
on the main()
function and looking at the call site:
80009da: 4668 mov r0, sp
before = DWT->CYCCNT;
80009dc: 6865 ldr r5, [r4, #4]
f(x);
80009de: f7ff fbd3 bl 8000188 <f>
after = DWT->CYCCNT;
80009e2: 6860 ldr r0, [r4, #4]
Thus, the only instruction between both reads of the DWT cycle counter is bl
which branches into the f()
assembly function.
The non-unrolled version runs in 1536 cycles, whereas the unrolled version runs in 1356 cycles.
Here is my spreadsheet for the non-unrolled version (not accounting for the already measured 1-cycle overhead of reading the DWT cycle counter):
Instruction | Loop iters | Macro repeats | Count | Cycle count | Total cycles |
---|---|---|---|---|---|
bl (from main) | 1 | 1 | 1 | 4 | 4 |
push (12 regs) | 1 | 1 | 1 | 13 | 13 |
ldmia (12 regs) | 1 | 1 | 1 | 13 | 13 |
mov | 1 | 1 | 1 | 1 | 1 |
push (1 reg) | 1 | 1 | 1 | 2 | 2 |
ldr | 25 | 4 | 1 | 2 | 200 |
and | 25 | 4 | 8 | 1 | 800 |
str | 25 | 4 | 1 | 2 | 200 |
ldr | 1 | 1 | 1 | 2 | 2 |
subs | 1 | 1 | 1 | 1 | 1 |
str | 1 | 1 | 1 | 2 | 2 |
bne (taken) | 24 | 1 | 1 | 4 | 96 |
bne (not taken) | 1 | 1 | 1 | 1 | 1 |
stmia (12 regs) | 1 | 1 | 1 | 13 | 13 |
pop (11 regs + pc) | 1 | 1 | 1 | 16 | 16 |
1364 |
The last column is just the product of the 2nd through 5th columns of the table, and the last row is a sum of all values in the "Total" column. This is the predicted execution time.
Thus, for the non-unrolled version: 1536/(1364 + 1) - 1 = 12.5% error (the + 1 term is to account for the DWT cycle counter overhead).
As for the unrolled version, a few instructions must be removed from the table above: the loop setup (mov
and push (1 reg)
) and the loop increment and branch (ldr
, subs
, str
and bne
, both taken and not taken). This works out to 105 cycles, so the predicted performance would be 1259 cycles.
For the unrolled version, we have 1356/(1259 + 1) - 1 = 7.6% error.