Manual vectorization using AVX vector intrinsics only runs about the same speed as 4 scalar FP adds on Ryzen?

Question

So I decided to take a look at how to use SSE, AVX, ... in C via Intel® Intrinsics. Not because of any actual interest to use it for something, but out of pure curiosity. Trying to check if code using AVX is actually faster than non-AVX code, I was a bit surprised by the results. Here is my C code:

#include <stdio.h>
#include <stdlib.h>

#include <emmintrin.h>
#include <immintrin.h>


/*** Sum up two vectors using AVX ***/
#define __vec_sum_4d_d64(src_vec1, src_vec2, dst_vec) \
  _mm256_store_pd(dst_vec, _mm256_add_pd(_mm256_load_pd(src_vec1), _mm256_load_pd(src_vec2)));

/*** Sum up two vectors without AVX ***/
#define __vec_sum_4d(src_vec1, src_vec2, dst_vec) \
  dst_vec[0] = src_vec1[0] + src_vec2[0];\
  dst_vec[1] = src_vec1[1] + src_vec2[1];\
  dst_vec[2] = src_vec1[2] + src_vec2[2];\
  dst_vec[3] = src_vec1[3] + src_vec2[3];


int main (int argc, char *argv[]) {
  unsigned long i;

  double dvec1[4] = {atof(argv[1]), atof(argv[2]), atof(argv[3]), atof(argv[4])};
  double dvec2[4] = {atof(argv[5]), atof(argv[6]), atof(argv[7]), atof(argv[8])}; 

#if 1
  for (i = 0; i < 3000000000; i++) {
    __vec_sum_4d(dvec1, dvec2, dvec2);
  }
#endif
#if 0
  for (i = 0; i < 3000000000; i++) {
    __vec_sum_4d_d64(dvec1, dvec2, dvec2);
  }
#endif

  printf("%10.10lf %10.10lf %10.10lf %10.10lf\n", dvec2[0], dvec2[1], dvec2[2], dvec2[3]);
}

I simply switch #if 1 to #if 0 and the other way around to switch between "modes" (AVX and non-AVX). My expectation would be, that the loop using AVX would be at least somewhat faster than the other one, but it isn't. I compiled the code with gcc version 10.2.0 (GCC) and these: -O2 --std=gnu99 -lm -mavx2 flags.

> time ./noavx.x86_64 1 2 3 4 5 6 7 8
3000000005.0000000000 6000000006.0000000000 9000000007.0000000000 12000000008.0000000000

real    0m2.150s
user    0m2.147s
sys 0m0.000s

> time ./withavx.x86_64 1 2 3 4 5 6 7 8
3000000005.0000000000 6000000006.0000000000 9000000007.0000000000 12000000008.0000000000

real    0m2.168s
user    0m2.165s
sys 0m0.000s

As you can see, they run at practically the same speed. I also tried to increase the number of iterations by a factor of ten, but the results will simply scale up proportionally. Also note that the printed output values are the same for both executables, so I think that it is save to say that both perform the same calculations. Digging deeper i took a look at the assembly and was even more confused. Here are the important parts of both (only the loop):

; With avx
1070:   c5 fd 58 c1             vaddpd %ymm1,%ymm0,%ymm0
1074:   48 83 e8 01             sub    $0x1,%rax
1078:   75 f6                   jne    1070

; Without avx
1080:   c5 fb 58 c4             vaddsd %xmm4,%xmm0,%xmm0
1084:   c5 f3 58 cd             vaddsd %xmm5,%xmm1,%xmm1
1088:   c5 eb 58 d7             vaddsd %xmm7,%xmm2,%xmm2
108c:   c5 e3 58 de             vaddsd %xmm6,%xmm3,%xmm3
1090:   48 83 e8 01             sub    $0x1,%rax
1094:   75 ea                   jne    1080

In my understanding the second one should be way slower since besides decrementing the counter and the conditional jump there are four times as many instructions in it. Why is it not slower? Is the vaddsd instruction just four times faster than vaddpd?

If this is relevant, my system runs on a AMD Ryzen 5 2600X Six-Core Processor which supports AVX.

This seems like something that the compiler can pre-calculate at compile-time. Some 90% of all benchmarking questions on SO are caused by wrong benchmarking methods. Consider making those two double arrays parameters of a function instead, then disassemble that function. — Lundin, Mar 12 '21 at 15:10
In addition to the other comments, note that it may be limited by the memory access speed. — JDługosz, Mar 12 '21 at 15:28
@Lundin I edited the code so that the data is loaded from argv - results remain the same. I also ran the program in GDB, it actually goes trough the single instructions of the loop, so by my understanding nothing is pre-calculated at compile time. — GimbaAghDurba, Mar 12 '21 at 15:32
@JDługosz It could very well be that my understanding is wrong, but looking at the assembly it seems like it doesn't load anything from memory inside the loop, so memory shouldn't be a bottleneck in this case, right? — GimbaAghDurba, Mar 12 '21 at 15:33
`vaddpd` and `vaddsd` both have a throughput of 2 instructions/cycle, but a latency of 3 or 4 cycles. Your benchmark essentially just measures the latency, not the thoughput. — chtz, Mar 12 '21 at 16:13
@chtz So i looked up a reference document (https://www.amd.com/system/files/TechDocs/55723_3_01_0.zip) containing throughput and latency information for AMD 17h (which is Zen1/Zen2) processors, and the throughput/latency for `vaddpd` and `vaddsd` seems to be the same: Troughput=2, Latency=3. Seems like this can't really be the reason. — GimbaAghDurba, Mar 12 '21 at 17:41
The ymm registers are wider than the xmm registers. But the second loop is doing more operations using more registers. I believe the answer to your question is pipelining. The CPU is likely able to use the calculation hardware to run two xmm operations in the same amount of time as one ymm op. — Zan Lynx, Mar 12 '21 at 18:32
@GimbaAghDurba I posted a more detailed answer. I hope this makes it somewhat clearer. — chtz, Mar 12 '21 at 18:35
@ZanLynx: Not exactly: both are bottlenecked by the same 3-cycle latency of FP add as the loop-carried dependency, not throughput limits. One `vaddpd ymm` is overall cheaper (2 uops) than 4x `vaddpd/sd xmm` (4 uops for the same back-end ports, and more front-end cost). It is basically true that each YMM operation costs the same throughput resources as two XMM operations (modulo some possible front-end differences), though, but neither front-end nor back-end throughput is the bottleneck. — Peter Cordes, Mar 13 '21 at 04:37
@GimbaAghDurba: Note that `gcc -O3` enables auto-vectorization, and would hopefully make the same asm for both versions. Or maybe 2x `vaddpd xmm` depending on its tuning choices. — Peter Cordes, Mar 13 '21 at 04:41

stepan · Accepted Answer · 2021-03-12T20:20:47.610

With AVX

; With avx
1070:   c5 fd 58 c1             vaddpd %ymm1,%ymm0,%ymm0
1074:   48 83 e8 01             sub    $0x1,%rax
1078:   75 f6                   jne    1070

This loop is using ymm0 as accumulator. In other words it is doing ymm0 += ymm1 (this is a vector operation; adding 4 double values at once). Therefore it has loop-carried dependency on ymm0 (every new addition has to wait for the previous addition to finish and uses the result to start the next addition). vaddpd has latency=3, throughput=1 for Zen+ (according to https://www.uops.info/table.html). Loop carried dependency makes this loop bottleneck on latency of vaddpd, so your loop can get at best 3 cycles/iteration. Only one vaddpd addition is in-flight in the CPU, which is under-utilizing it's capability by a lot.

To make this faster add more accumulators (have more vectors to sum). It can (in theory) get 3 times faster due to pipelining (3 full ymm additions in-flight), as long as it does not get limited by something else.

Without AVX

; Without avx
1080:   c5 fb 58 c4             vaddsd %xmm4,%xmm0,%xmm0
1084:   c5 f3 58 cd             vaddsd %xmm5,%xmm1,%xmm1
1088:   c5 eb 58 d7             vaddsd %xmm7,%xmm2,%xmm2
108c:   c5 e3 58 de             vaddsd %xmm6,%xmm3,%xmm3
1090:   48 83 e8 01             sub    $0x1,%rax
1094:   75 ea                   jne    1080

This loop accumulates results into 4 different accumulators. Basically it is doing:

xmm0 += xmm4
xmm1 += xmm5
xmm2 += xmm7
xmm3 += xmm6

All of these additions are independent from each other (and they are scalar additions, so each only operates on a single 64-bit floating point value). vaddsd has latency=3, throughput=0.5 (Cycles Per Instruction). Which means that it can start executing first 2 additions in one cycle. Then on the next cycle it will start the second pair of additions. Therefore it is possible to achieve 2 cycles/iteration for this loop based on throughput. But latency, as you recall is 3 cycles. So this loop is also bottlenecked on latency. Unroll once (with 4 additional accumulators; alternatively break loop-carried dep.chain within the loop by adding xmm4-7 between each other before adding it to the main accumulator) to get rid of that bottleneck (it may get ~50% faster).

Note that this ("without AVX") disassembly is still using VEX encoding, so technically still requires AVX-capable CPU.

On Benchmarking

Note that your disassembly does not have any loads or stores, so this may or may not be representative of performance comparison for adding 2 arrays of 4-double vectors.

On Zen1, 256-bit math instructions decode to 2 uops (i.e. they split YMM registers into two 128-bit halves). There could maybe be a front-end effect going on there; I'd have expected the YMM version to be at least as fast since it has the same latency but runs half as many uops. IDK, maybe scheduling keeps the halves tied together somewhat so it makes it possible to "lose cycles" on that one critical path? (If that's happening, probably unrolling with even more than 3 accumulators would be good to give scheduling some slack, like in [this Q&A](https://stackoverflow.com/q/45113527/224132) — Peter Cordes, Mar 13 '21 at 04:26
Oh, they *are* basically the same speed, so both are probably hitting the latency bottleneck reliably. Maybe just a CPU-frequency or other warm-up issue, or some kind of code alignment thing? Although the two tests aren't in the same process so it's not like one can warm-up the CPU for the other. But still, `time` isn't that precise. — Peter Cordes, Mar 13 '21 at 04:31
@PeterCordes Yeah, the timing looks similar. What are you trying to figure out? "Maybe just a CPU-frequency or other warm-up issue, or some kind of code alignment thing" I am not following. — stepan, Mar 13 '21 at 05:46
There is some minor time difference in the OP's measurements, which might just be noise or might have be due to some minor effect. That's what I was guessing at. On 2nd thought code alignment is unlikely because the bottleneck is not front-end. — Peter Cordes, Mar 13 '21 at 05:53
When I wrote my first comment, I had only read the very first part of the question (where it said scalar was faster) and glanced at the code, then scrolled down to see if there were already answers summarizing what was going on. There were, so I was looking at this answer for an explanation of why AVX was *slower* when I wrote my comment, not why they were the same speed. So that's why I was trying to think of complicated explanations in the first place :P — Peter Cordes, Mar 13 '21 at 05:54

score 2 · Answer 2 · answered Mar 12 '21 at 18:34

You are dealing with a latency issue. Depending on the CPU you have to wait 3 or 4 cycles until you can use the result of a vaddpd or vaddsd instruction. But within 1 cycle up to 2 vaddpd or vaddsd instructions can be executed (if the CPU does not have to wait for source registers).

Since in your loop

; Without avx
1080:   c5 fb 58 c4             vaddsd %xmm4,%xmm0,%xmm0
1084:   c5 f3 58 cd             vaddsd %xmm5,%xmm1,%xmm1
1088:   c5 eb 58 d7             vaddsd %xmm7,%xmm2,%xmm2
108c:   c5 e3 58 de             vaddsd %xmm6,%xmm3,%xmm3
1090:   48 83 e8 01             sub    $0x1,%rax
1094:   75 ea                   jne    1080

each vaddsd depends on the result from the previous iteration, it has to wait 3 or 4 cycles before this can be executed. But the execution of the all the vaddsd and the sub and jne can happen during that time. Therefore, for this simple loop it does not make a difference, if you execute one vaddpd or four vaddsd.

To fully exhaust the vaddpd instruction, you need to execute 6 or 8 of them which do not depend on the result of each other (or have other instructions which do some independent work).

Manual vectorization using AVX vector intrinsics only runs about the same speed as 4 scalar FP adds on Ryzen?

2 Answers2

With AVX

Without AVX

On Benchmarking