Expected maximum instruction throughput for simple c++ for loops

Question

First of all, sorry for this long text, but I find this topic quite interesting and I want to understand the details.

Disclaimer: I'm not that familiar with reading assembler code, so there might be simple stuff that is obvious to somebody with long experience reading assembly code that I don't see.

I have read this text about performance speed limits (PSL), and I wanted to test the same for some simple loops to get an idea about the realistic upper bounds.

I'm investigating the performance of the AVX and FMA instruction sets individually.

Testing environment:

Intel Broadwell architecture and newer
The actual benchmark platform is Coffee Lake
Intel Turbo boost disabled
macOS 11.2
Xcode 12.4 - clang 12.0.0
Compiled with -O3 -ffast-math (-mavx or -mfma)
Aligned memory on 32 bytes boundaries
Multiple iterations with all memory in L1-cache

AVX performance for simple C(:) = A(:) + B(:) loop

For the following loop:

for (int n = 0; n < N; ++n) {
    C[n] = A[n] + B[n];
}

I get the following assembler code:

vmovups             -96(%r12,%rdx,4), %ymm0
vmovups             -64(%r12,%rdx,4), %ymm1
vmovups             -32(%r12,%rdx,4), %ymm2
vmovups             (%r12,%rdx,4), %ymm3
vaddps              -96(%r15,%rdx,4), %ymm0, %ymm0
vaddps              -64(%r15,%rdx,4), %ymm1, %ymm1
vaddps              -32(%r15,%rdx,4), %ymm2, %ymm2
vaddps              (%r15,%rdx,4), %ymm3, %ymm3
vmovups             %ymm0, -96(%r13,%rdx,4)
vmovups             %ymm1, -64(%r13,%rdx,4)
vmovups             %ymm2, -32(%r13,%rdx,4)
vmovups             %ymm3, (%r13,%rdx,4)
addq                $32, %rdx
cmpq                $1016, %rdx
jne

If I count the instructions I find that the loop is doing:

8 loads
4 add
4 stores
1 add to increase the index
1 cmp+jmp (I have read that these two should be counted as one)

If the max-number-of-instructions (4) per cycle was the limit, one should think that this loop could be executed with a throughput of:

(8+4+4+2)/4/8/4 = 0.14 cycles/iteration

where 4/8/4 is max_instructions_per_cycle/simd_width/loop_unroll_factor. Hence, iteration points to one iteration of the original for loop (before unroll and vectorization).

The PSL article states that we are here restricted by the fact that intel CPUs can't do 2 simultaneous reads together with 1 store if the store is not using a simple addressing scheme. So we have to count half of the loads twice, and the same for the store operations.

So we get:

(8+4+4+2+(4+4))/4/8/4 = 0.2 cycles/iteration

where (4+4) is the extra penalty for conflicting loads and stores.

And when benchmarking I get:

0.20 cycles/iteration

That is great!

Next, I tried to test the "port-7 hack" for doing two loads and one store in a single cycle.

The code (from PSL) looks like this:

// magic store-on-port-7 hack
float* c = C;
const float* a = A;
const float* b = B;
const float* end = c + N;
ptrdiff_t a_offset = (a - c);
ptrdiff_t b_offset = (b - c);
for (; c < end; c++) {
    *c = *(c + a_offset) + (*(c + b_offset));
}

This yields the following assembly code:

vmovups             -96(%rbx,%rdx,4), %ymm0
vmovups             -64(%rbx,%rdx,4), %ymm1
vmovups             -32(%rbx,%rdx,4), %ymm2
vmovups             (%rbx,%rdx,4), %ymm3
vaddps              -96(%rbx,%rcx,4), %ymm0, %ymm0
vaddps              -64(%rbx,%rcx,4), %ymm1, %ymm1
vaddps              -32(%rbx,%rcx,4), %ymm2, %ymm2
vaddps              (%rbx,%rcx,4), %ymm3, %ymm3
vmovups             %ymm0, -96(%rbx)
vmovups             %ymm1, -64(%rbx)
vmovups             %ymm2, -32(%rbx)
vmovups             %ymm3, (%rbx)
subq                $-128, %rbx
addq                $-32, %rdi
jne

We now see that the store operations are using simple addressing.

If we count the number of instructions we get an expected:

(8+4+4+3)/4/8/4 = 0.148 cycles/iteration.

However, when measuring this I get:

0.156 cycles/iteration

It looks like I misestimated with 1 instruction since (8+4+4+4)/4/8/4 = 0.156. So there is some limitation in play here that I don't understand.

The intel intrinsics guide doesn't list information about Coffee Lake, but Agner Fog's instruction tables lists vaddps with a latency of 4 and the throughput of 0.5 (the same as for Ice- and Skylake). Does this mean we should have unrolled the loop with a factor of 8, to have 8 add instructions in flight per iteration, or is the CPU smart enough to figure this out?

When forcing an 8-times unroll for the "port-7-hack", I get the following assembly code:

vmovups             -224(%rbx,%rdx,4), %ymm0
vmovups             -192(%rbx,%rdx,4), %ymm1
vmovups             -160(%rbx,%rdx,4), %ymm2
vmovups             -128(%rbx,%rdx,4), %ymm3
vmovups             -96(%rbx,%rdx,4), %ymm4
vmovups             -64(%rbx,%rdx,4), %ymm5
vmovups             -32(%rbx,%rdx,4), %ymm6
vmovups             (%rbx,%rdx,4), %ymm7
vaddps              -224(%rbx,%rcx,4), %ymm0, %ymm0
vaddps              -192(%rbx,%rcx,4), %ymm1, %ymm1
vaddps              -160(%rbx,%rcx,4), %ymm2, %ymm2
vaddps              -128(%rbx,%rcx,4), %ymm3, %ymm3
vaddps              -96(%rbx,%rcx,4), %ymm4, %ymm4
vaddps              -64(%rbx,%rcx,4), %ymm5, %ymm5
vaddps              -32(%rbx,%rcx,4), %ymm6, %ymm6
vaddps              (%rbx,%rcx,4), %ymm7, %ymm7
vmovups             %ymm0, -224(%rbx)
vmovups             %ymm1, -192(%rbx)
vmovups             %ymm2, -160(%rbx)
vmovups             %ymm3, -128(%rbx)
vmovups             %ymm4, -96(%rbx)
vmovups             %ymm5, -64(%rbx)
vmovups             %ymm6, -32(%rbx)
vmovups             %ymm7, (%rbx)
addq                $256, %rbx
addq                $-64, %rdi
jne

In my head, this should have a throughput of

(16+8+8+3)/4/8/8 = 0.137 cycles/iteration.

However, when measuring, the performance drops to:

0.194 cycles/iteration

So it is worse with an 8-factor unroll. How come?

The PSL comments that:

Note that even if stores have non-complex addressing, it may not be possible to sustain 2 loads/1 store, because the store may sometimes choose one of the port 2 or port 3 AGUs instead, starving a load that cycle.

Is this what I'm seeing? In theory, it should have been possible to unroll the loop with a large factor to get really close to the magic 0.125 cycles/iteration limit.

Simple C(:) = A(:) + B(:)*f loop

I'm also interesting in the practical upperbounds for another simple for-loop:

float f = <some scalar number>
for (int n = 0; n < N; ++n) {
    C[n] = A[n] + B[n]*f;
}

, for both the AVX and FMA instruction set.

AVX

For AVX we here get one more mathematical instruction per iteration (the assembly looks the same just with 4 extra vmulps instructions per unrolled iteration), so we are probably not limited by having to do simultaneous loads and stores anymore(?), but I'm not exactly sure.

With perfect throughput, we should get 0.17 cycles/iterations, but when measuring I get a throughput of 0.20 cycles/iterations. So there are 4 instructions in penalty per iterations in the unrolled loop. Should we count the 4 extra vmulps instructions with a double penalty?

Is this the limiting factor here?

FMA

With FMA (compiled with -mfma) the extra multiplication is fused with the addition to a single instruction, so we should be back at 0.2 cycles/iteration, hench limited by the port-7 issue. But, when measuring I get 0.21. So there is 1 extra instruction in penalty somewhere that I don't get.

With the "port-7 hack" I think it should be possible to get back to a performance equal to the C(:) = A(:) + B(:) loop. Hence, a throughput of (8+4+4+3)/4/8/4 = 0.148 cycles/iteration. The assembly code looks that same, just with the vaddps replaced by a vfmadd213ps instruction.

However, when measuring I get a throughput of:

0.162 cycles/iteration

So I'm 2 instructions from my upper bound estimate of 8+4+4+3 instructions/unrolled-loop-iteration.

Again, when trying to increase the unroll factor, the performance drops.

Is it outrageous to expect close-to-0.125 cycles/iterations throughput for this loop?

If yes, what is the limiting factor(s)?

If you reached this far, thank you!

Summary

What is the practical upper bound performance for a C(:) = A(:) + B(:) loop using AVX?
What is the practical upper bound performance for a C(:) = A(:) + B(:)*f loop using AVX?
What is the practical upper bound performance for a C(:) = A(:) + B(:)*f loop using FMA?

You're going to want a pointer-increment instead of 4 indexed addressing modes for `vaddps` so the memory source operand can stay micro-fused instead of un-laminating in the front-end ([Micro fusion and addressing modes](https://stackoverflow.com/q/26046634)/). Indexing the `vmovups` loads relative to either of the other pointers is good, though. Although with a back-end bottleneck of 2 loads per clock, that front-end bottleneck won't really be the limiting factor. (2x `vaddps` per clock on SKL and later doesn't come into it because you have 2 loads per FP math op, with add or FMA) — Peter Cordes, Feb 17 '21 at 13:09
*Does this mean we should have unrolled the loop with a factor of 8, to have 8 add instructions in flight per iteration* - you have no loop-carried dep chain (except the pointers), so no. [Register-renaming makes it safe](https://stackoverflow.com/questions/45113527/why-does-mulss-take-only-3-cycles) to reuse the same register repeatedly (unrolled or not); the only reason to use multiple YMM regs here is a little bit of software-pipelining (doing all the loads first, then all the stores, hiding a small bit of load / ALU latency so OoO exec doesn't have to work as hard) — Peter Cordes, Feb 17 '21 at 13:15

Expected maximum instruction throughput for simple c++ for loops

AVX performance for simple C(:) = A(:) + B(:) loop

Simple C(:) = A(:) + B(:)*f loop

AVX

FMA

Summary

0 Answers0