So I decided to take a look at how to use SSE, AVX, ... in C via Intel® Intrinsics. Not because of any actual interest to use it for something, but out of pure curiosity. Trying to check if code using AVX is actually faster than non-AVX code, I was a bit surprised by the results. Here is my C code:
#include <stdio.h>
#include <stdlib.h>
#include <emmintrin.h>
#include <immintrin.h>
/*** Sum up two vectors using AVX ***/
#define __vec_sum_4d_d64(src_vec1, src_vec2, dst_vec) \
_mm256_store_pd(dst_vec, _mm256_add_pd(_mm256_load_pd(src_vec1), _mm256_load_pd(src_vec2)));
/*** Sum up two vectors without AVX ***/
#define __vec_sum_4d(src_vec1, src_vec2, dst_vec) \
dst_vec[0] = src_vec1[0] + src_vec2[0];\
dst_vec[1] = src_vec1[1] + src_vec2[1];\
dst_vec[2] = src_vec1[2] + src_vec2[2];\
dst_vec[3] = src_vec1[3] + src_vec2[3];
int main (int argc, char *argv[]) {
unsigned long i;
double dvec1[4] = {atof(argv[1]), atof(argv[2]), atof(argv[3]), atof(argv[4])};
double dvec2[4] = {atof(argv[5]), atof(argv[6]), atof(argv[7]), atof(argv[8])};
#if 1
for (i = 0; i < 3000000000; i++) {
__vec_sum_4d(dvec1, dvec2, dvec2);
}
#endif
#if 0
for (i = 0; i < 3000000000; i++) {
__vec_sum_4d_d64(dvec1, dvec2, dvec2);
}
#endif
printf("%10.10lf %10.10lf %10.10lf %10.10lf\n", dvec2[0], dvec2[1], dvec2[2], dvec2[3]);
}
I simply switch #if 1
to #if 0
and the other way around to switch between "modes" (AVX and non-AVX).
My expectation would be, that the loop using AVX would be at least somewhat faster than the other one, but it isn't. I compiled the code with gcc version 10.2.0 (GCC)
and these: -O2 --std=gnu99 -lm -mavx2
flags.
> time ./noavx.x86_64 1 2 3 4 5 6 7 8
3000000005.0000000000 6000000006.0000000000 9000000007.0000000000 12000000008.0000000000
real 0m2.150s
user 0m2.147s
sys 0m0.000s
> time ./withavx.x86_64 1 2 3 4 5 6 7 8
3000000005.0000000000 6000000006.0000000000 9000000007.0000000000 12000000008.0000000000
real 0m2.168s
user 0m2.165s
sys 0m0.000s
As you can see, they run at practically the same speed. I also tried to increase the number of iterations by a factor of ten, but the results will simply scale up proportionally. Also note that the printed output values are the same for both executables, so I think that it is save to say that both perform the same calculations. Digging deeper i took a look at the assembly and was even more confused. Here are the important parts of both (only the loop):
; With avx
1070: c5 fd 58 c1 vaddpd %ymm1,%ymm0,%ymm0
1074: 48 83 e8 01 sub $0x1,%rax
1078: 75 f6 jne 1070
; Without avx
1080: c5 fb 58 c4 vaddsd %xmm4,%xmm0,%xmm0
1084: c5 f3 58 cd vaddsd %xmm5,%xmm1,%xmm1
1088: c5 eb 58 d7 vaddsd %xmm7,%xmm2,%xmm2
108c: c5 e3 58 de vaddsd %xmm6,%xmm3,%xmm3
1090: 48 83 e8 01 sub $0x1,%rax
1094: 75 ea jne 1080
In my understanding the second one should be way slower since besides decrementing the counter and the conditional jump there are four times as many instructions in it. Why is it not slower? Is the vaddsd
instruction just four times faster than vaddpd
?
If this is relevant, my system runs on a AMD Ryzen 5 2600X Six-Core Processor
which supports AVX.