First of all, sorry for this long text, but I find this topic quite interesting and I want to understand the details.
Disclaimer: I'm not that familiar with reading assembler code, so there might be simple stuff that is obvious to somebody with long experience reading assembly code that I don't see.
I have read this text about performance speed limits (PSL), and I wanted to test the same for some simple loops to get an idea about the realistic upper bounds.
I'm investigating the performance of the AVX and FMA instruction sets individually.
Testing environment:
- Intel Broadwell architecture and newer
- The actual benchmark platform is Coffee Lake
- Intel Turbo boost disabled
- macOS 11.2
- Xcode 12.4 - clang 12.0.0
- Compiled with -O3 -ffast-math (-mavx or -mfma)
- Aligned memory on 32 bytes boundaries
- Multiple iterations with all memory in L1-cache
AVX performance for simple C(:) = A(:) + B(:) loop
For the following loop:
for (int n = 0; n < N; ++n) {
C[n] = A[n] + B[n];
}
I get the following assembler code:
vmovups -96(%r12,%rdx,4), %ymm0
vmovups -64(%r12,%rdx,4), %ymm1
vmovups -32(%r12,%rdx,4), %ymm2
vmovups (%r12,%rdx,4), %ymm3
vaddps -96(%r15,%rdx,4), %ymm0, %ymm0
vaddps -64(%r15,%rdx,4), %ymm1, %ymm1
vaddps -32(%r15,%rdx,4), %ymm2, %ymm2
vaddps (%r15,%rdx,4), %ymm3, %ymm3
vmovups %ymm0, -96(%r13,%rdx,4)
vmovups %ymm1, -64(%r13,%rdx,4)
vmovups %ymm2, -32(%r13,%rdx,4)
vmovups %ymm3, (%r13,%rdx,4)
addq $32, %rdx
cmpq $1016, %rdx
jne
If I count the instructions I find that the loop is doing:
- 8 loads
- 4 add
- 4 stores
- 1 add to increase the index
- 1 cmp+jmp (I have read that these two should be counted as one)
If the max-number-of-instructions (4) per cycle was the limit, one should think that this loop could be executed with a throughput of:
(8+4+4+2)/4/8/4 = 0.14 cycles/iteration
where 4/8/4
is max_instructions_per_cycle/simd_width/loop_unroll_factor. Hence, iteration points to one iteration of the original for loop (before unroll and vectorization).
The PSL article states that we are here restricted by the fact that intel CPUs can't do 2 simultaneous reads together with 1 store if the store is not using a simple addressing scheme. So we have to count half of the loads twice, and the same for the store operations.
So we get:
(8+4+4+2+(4+4))/4/8/4 = 0.2 cycles/iteration
where (4+4) is the extra penalty for conflicting loads and stores.
And when benchmarking I get:
0.20 cycles/iteration
That is great!
Next, I tried to test the "port-7 hack" for doing two loads and one store in a single cycle.
The code (from PSL) looks like this:
// magic store-on-port-7 hack
float* c = C;
const float* a = A;
const float* b = B;
const float* end = c + N;
ptrdiff_t a_offset = (a - c);
ptrdiff_t b_offset = (b - c);
for (; c < end; c++) {
*c = *(c + a_offset) + (*(c + b_offset));
}
This yields the following assembly code:
vmovups -96(%rbx,%rdx,4), %ymm0
vmovups -64(%rbx,%rdx,4), %ymm1
vmovups -32(%rbx,%rdx,4), %ymm2
vmovups (%rbx,%rdx,4), %ymm3
vaddps -96(%rbx,%rcx,4), %ymm0, %ymm0
vaddps -64(%rbx,%rcx,4), %ymm1, %ymm1
vaddps -32(%rbx,%rcx,4), %ymm2, %ymm2
vaddps (%rbx,%rcx,4), %ymm3, %ymm3
vmovups %ymm0, -96(%rbx)
vmovups %ymm1, -64(%rbx)
vmovups %ymm2, -32(%rbx)
vmovups %ymm3, (%rbx)
subq $-128, %rbx
addq $-32, %rdi
jne
We now see that the store operations are using simple addressing.
If we count the number of instructions we get an expected:
(8+4+4+3)/4/8/4 = 0.148 cycles/iteration.
However, when measuring this I get:
0.156 cycles/iteration
It looks like I misestimated with 1 instruction since (8+4+4+4)/4/8/4 = 0.156
.
So there is some limitation in play here that I don't understand.
The intel intrinsics guide doesn't list information about Coffee Lake, but Agner Fog's instruction tables lists vaddps
with a latency of 4 and the throughput of 0.5 (the same as for Ice- and Skylake).
Does this mean we should have unrolled the loop with a factor of 8, to have 8 add instructions in flight per iteration, or is the CPU smart enough to figure this out?
When forcing an 8-times unroll for the "port-7-hack", I get the following assembly code:
vmovups -224(%rbx,%rdx,4), %ymm0
vmovups -192(%rbx,%rdx,4), %ymm1
vmovups -160(%rbx,%rdx,4), %ymm2
vmovups -128(%rbx,%rdx,4), %ymm3
vmovups -96(%rbx,%rdx,4), %ymm4
vmovups -64(%rbx,%rdx,4), %ymm5
vmovups -32(%rbx,%rdx,4), %ymm6
vmovups (%rbx,%rdx,4), %ymm7
vaddps -224(%rbx,%rcx,4), %ymm0, %ymm0
vaddps -192(%rbx,%rcx,4), %ymm1, %ymm1
vaddps -160(%rbx,%rcx,4), %ymm2, %ymm2
vaddps -128(%rbx,%rcx,4), %ymm3, %ymm3
vaddps -96(%rbx,%rcx,4), %ymm4, %ymm4
vaddps -64(%rbx,%rcx,4), %ymm5, %ymm5
vaddps -32(%rbx,%rcx,4), %ymm6, %ymm6
vaddps (%rbx,%rcx,4), %ymm7, %ymm7
vmovups %ymm0, -224(%rbx)
vmovups %ymm1, -192(%rbx)
vmovups %ymm2, -160(%rbx)
vmovups %ymm3, -128(%rbx)
vmovups %ymm4, -96(%rbx)
vmovups %ymm5, -64(%rbx)
vmovups %ymm6, -32(%rbx)
vmovups %ymm7, (%rbx)
addq $256, %rbx
addq $-64, %rdi
jne
In my head, this should have a throughput of
(16+8+8+3)/4/8/8 = 0.137 cycles/iteration.
However, when measuring, the performance drops to:
0.194 cycles/iteration
So it is worse with an 8-factor unroll. How come?
The PSL comments that:
Note that even if stores have non-complex addressing, it may not be possible to sustain 2 loads/1 store, because the store may sometimes choose one of the port 2 or port 3 AGUs instead, starving a load that cycle.
Is this what I'm seeing? In theory, it should have been possible to unroll the loop with a large factor to get really close to the magic 0.125 cycles/iteration
limit.
Simple C(:) = A(:) + B(:)*f loop
I'm also interesting in the practical upperbounds for another simple for-loop:
float f = <some scalar number>
for (int n = 0; n < N; ++n) {
C[n] = A[n] + B[n]*f;
}
, for both the AVX and FMA instruction set.
AVX
For AVX we here get one more mathematical instruction per iteration (the assembly looks the same just with 4 extra vmulps
instructions per unrolled iteration), so we are probably not limited by having to do simultaneous loads and stores anymore(?), but I'm not exactly sure.
With perfect throughput, we should get 0.17 cycles/iterations
, but when measuring I get a throughput of 0.20 cycles/iterations.
So there are 4 instructions in penalty per iterations in the unrolled loop. Should we count the 4 extra vmulps
instructions with a double penalty?
Is this the limiting factor here?
FMA
With FMA (compiled with -mfma
) the extra multiplication is fused with the addition to a single instruction, so we should be back at 0.2 cycles/iteration
, hench limited by the port-7 issue. But, when measuring I get 0.21
. So there is 1 extra instruction in penalty somewhere that I don't get.
With the "port-7 hack" I think it should be possible to get back to a performance equal to the C(:) = A(:) + B(:)
loop. Hence, a throughput of (8+4+4+3)/4/8/4 = 0.148 cycles/iteration
. The assembly code looks that same, just with the vaddps
replaced by a vfmadd213ps
instruction.
However, when measuring I get a throughput of:
0.162 cycles/iteration
So I'm 2 instructions from my upper bound estimate of 8+4+4+3
instructions/unrolled-loop-iteration.
Again, when trying to increase the unroll factor, the performance drops.
Is it outrageous to expect close-to-0.125 cycles/iterations
throughput for this loop?
If yes, what is the limiting factor(s)?
If you reached this far, thank you!
Summary
- What is the practical upper bound performance for a
C(:) = A(:) + B(:)
loop using AVX? - What is the practical upper bound performance for a
C(:) = A(:) + B(:)*f
loop using AVX? - What is the practical upper bound performance for a
C(:) = A(:) + B(:)*f
loop using FMA?