Consider this simple C++ function to calculate the prefix sum of an array:
void prefix_sum(const uint32_t* input, uint32_t* output, size_t size) {
uint32_t total = 0;
for (size_t i = 0; i < size; i++) {
total += input[i];
output[i] = total;
}
}
The loop compiles to the following assembly on gcc 5.5:
.L5:
add ecx, DWORD PTR [rdi+rax*4]
mov DWORD PTR [rsi+rax*4], ecx
add rax, 1
cmp rdx, rax
jne .L5
I don't see anything that would prevent this from running at 1 cycle per iteration, yet I consistently measure it at 1.32 (+/- 0.01) cycles/iteration on my Skylake i7-6700HQ, when running it against 8 KiB input/output arrays.
The loop is served out of the uop cache and doesn't cross any uop cache boundary and performance counters don't indicate any front-end bottleneck.
It's 4 fused uops1, and this CPU can sustain 4 fused ops/cycle.
There are carried dependency chains through ecx
and rax
, each of 1 cycle, but these add
uops can go to any of the 4 ALU ports, so seem unlikely to conflict. The fused cmp
needs to go to p6 which is more of a concern, but I measure only 1.1 uops/iteration to p6. That would explain 1.1 cycles per iteration, but not 1.4. If I unroll the loop by 2x port pressure is much lower: less than 0.7 uops to all of p0156, yet performance is still unexpectedly slow at 1.3 cycles per iteration.
There is one store per iteration, but we can do one store per cycle.
There is one load per iteration, but we can do two of those per cycle.
There are two complex AGUs per cycle, but we can do two of those per cycle.
What's the bottleneck here?
Interestingly I tried the Ithermal performance predictor and it gets it almost exactly right: estimating 1.314 cycles versus my measurement of 1.32.
1 I confirmed macro and micro-fusion fusion via the uops_issued.any
counter which counts in the fused domain and reads 4.0 fused uops per iteration for this loop.