AVX, SSE sums are slower than gcc autovectorization

Question

I have a strange occurence and can't really explain that. I am trying to write some numerical codes and thus benchmark some implementations. I just wanted to benchmark some vector additions with SSE and AVX as well as gcc auto vectorization. To test that, I have used and modified the code below.

Code:

#include <iostream>
#include <immintrin.h>

#include "../../time/timer.hpp"


void ser(double* a, double* b, double* res, int size){
 for(int i(0); i < size; i++ )
 {
    res[i] = a[i] + b[i];
 }
}

void sse(double* a, double* b, double* res, int size){
 for (int i(0); i < (size & ~0x1); i += 2 )
 {
    const __m128d kA2   = _mm_load_pd( &a[i] );
    const __m128d kB2   = _mm_load_pd( &b[i] );
    const __m128d kRes = _mm_add_pd( kA2, kB2 );
    _mm_store_pd( &res[i], kRes );
 }
}

void avx(double* a, double* b, double* res, int size){
for (int i(0); i < (size & ~0x3); i += 4 )
 {
    const __m256d kA4   = _mm256_load_pd( &a[i] );
    const __m256d kB4   = _mm256_load_pd( &b[i] );
    const __m256d kRes = _mm256_add_pd( kA4, kB4 );
    _mm256_store_pd( &res[i], kRes );
 }
}


#define N 1e7*64

int main(int argc, char const *argv[])
{ 


 double* a = (double*)_mm_malloc(N*sizeof(double), 64);
 double* b = (double*)_mm_malloc(N*sizeof(double), 64);
 double* res = (double*)_mm_malloc(N*sizeof(double), 64);

 Timer tm;

 tm.start();
 avx(a,b,res,N);
 tm.stop();
 std::cout<<"AVX\t"<<tm.elapsed()<<" ms\t"
          <<1e-6*N/tm.elapsed() <<" GFLOP/s"<<std::endl;

 tm.start();
 sse(a,b,res,N);
 tm.stop();
 std::cout<<"SSE\t"<<tm.elapsed()<<" ms\t"
          <<1e-6*N/tm.elapsed() <<" GFLOP/s"<<std::endl;

 tm.start();
 ser(a,b,res,N);
 tm.stop();
 std::cout<<"SER\t"<<tm.elapsed()<<" ms\t"
          <<1e-6*N/tm.elapsed() <<" GFLOP/s"<<std::endl;
 return 0;
}

For the timings and calculated GFLOP/S, I get:

./test3
AVX 1892 ms 0.338266 GFLOP/s
SSE 408  ms 1.56863 GFLOP/s
SER 396  ms 1.61616 GFLOP/s

which is clearly really slow compared to the peak performance of about 170 GFLOP/s of my i5 6600K.

Am I missing anything important here? I know that vector addition on a CPU is not the best idea, but these results are really bad. Thanks for any clue.

You made a very common (but not obvious) benchmarking mistake. You forgot to initialize the memory from `malloc()`. So the AVX test (which runs first) is actually page-faulting. — Mysticial, Jan 25 '17 at 16:57
Once you fix that, you'll run into this issue: http://stackoverflow.com/questions/18159455/why-vectorizing-the-loop-does-not-have-performance-improvement — Mysticial, Jan 25 '17 at 17:00
@Mysticial Right! I benchmark now with hot memory and get about the same performance on every vectorization! Thanks. — jane lorasz, Jan 25 '17 at 18:03
When I benchmark I usually do a cold run without timing first. E.g. in your code do one iteration `avx(a,b,res,1)` first before timing. Actually, depending on what you're doing I think it can be useful to report the cold and hot timing. If a function was only going to be called once it could be misleading to report only the hot time. — Z boson, Jan 27 '17 at 08:05

score 0 · Answer 1 · answered Jan 24 '18 at 23:23

0

You application is likely to be memory-bounded rather than CPU-bounded. In other word, the memory bandwidth is the bottleneck so vectorization does not help much here.

answered Jan 24 '18 at 23:23

Yichao Zhou

465
1
4
8

score -2 · Answer 2 · edited May 23 '17 at 12:24

-2

It has most likely to do with branch prediction (read Why is it faster to process a sorted array than an unsorted array? for a more detailed explanation ). On my i5-4200U I get this when processing 100000000 doubles in this order AVX, SSE, SER.

AVX 807 ms 0.123916 GFLOP/s SSE 215 ms 0.465116 GFLOP/s SER 287 ms 0.348432 GFLOP/s

But if I change it to SER, AVX, SSE I get this

SER 753 ms 0.132802 GFLOP/s AVX 225 ms 0.444444 GFLOP/s SSE 196 ms 0.510204 GFLOP/s

Why it's so far from the peak performance of your CPU I don't know.

edited May 23 '17 at 12:24

Community

1
1

answered Jan 25 '17 at 16:57

JeppeSRC

89
1
9

2

This is has nothing to do with branch prediction. It's about page-commit. The first test is slower because the memory hasn't been committed yet so it page faults. – Mysticial Jan 25 '17 at 17:03
Ah sorry my bad. Thanks. – JeppeSRC Jan 25 '17 at 17:07
Which begs the question as to why put the effort in to explicitly use AVX when the compiler writers probably do it better than the average person? Sure if the compiler cannot auto-vectorise then one may need to, but there is an assumption the writing the instructions one's self somehow gets better code than the compiler can get, which is clearly not a certainty. – Holmz Jan 25 '17 at 19:47
@Holmz You are totally right. But it was more of a basic question about vectorizing sums and their possibility to vectorize (and why not). – jane lorasz Jan 25 '17 at 20:56
Basically the origional question wa about benchmarking, and the compiler was generating faster code than you were. This is also common for the rest of us, unless we spend the time write the SSE/AVX which is like assembly. It is always faster to get something going in a higher level language, and unless we are benchmarking, or going for the fastest, or have a case the compiler cannot do... Then we have no choice but to do it ourselves. – Holmz Jan 26 '17 at 20:59

AVX, SSE sums are slower than gcc autovectorization

2 Answers2