5

I am doing a benchmark about vectorization on MacOS with the following processor i7 :

$ sysctl -n machdep.cpu.brand_string

Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz

My MacBook Pro is from middle 2014.

I tried to use different flag options for vectorization : the 3 ones that interest me are SSE, AVX and AVX2.

For my benchmark, I add each element of 2 arrays and store the sum in a third array.

I must make you notice that I am working with double type for these arrays.

Here are the functions used into my benchmark code :

1*) First with SSE vectorization :

#ifdef SSE
#include <x86intrin.h>
#define ALIGN 16
void addition_tab(int size, double *a, double *b, double *c)
{

 int i;
 // Main loop
 for (i=size-1; i>=0; i-=2)
 {
  // Intrinsic SSE syntax
  const __m128d x = _mm_load_pd(a); // Load two x elements
  const __m128d y = _mm_load_pd(b); // Load two y elements
  const __m128d sum = _mm_add_pd(x, y); // Compute two sum elements
  _mm_store_pd(c, sum); // Store two sum elements

  // Increment pointers by 2 since SSE vectorizes on 128 bits = 16 bytes = 2*sizeof(double)
  a += 2;
  b += 2;
  c += 2;
 }

}
#endif

2*) Second with AVX256 vectorization :

#ifdef AVX256
#include <immintrin.h>
#define ALIGN 32
void addition_tab(int size, double *a, double *b, double *c)
{

 int i;
 // Main loop
 for (i=size-1; i>=0; i-=4)
 {
  // Intrinsic AVX syntax
  const __m256d x = _mm256_load_pd(a); // Load two x elements
  const __m256d y = _mm256_load_pd(b); // Load two y elements
  const __m256d sum = _mm256_add_pd(x, y); // Compute two sum elements
  _mm256_store_pd(c, sum); // Store two sum elements

  // Increment pointers by 4 since AVX256 vectorizes on 256 bits = 32 bytes = 4*sizeof(double)
  a += 4;
  b += 4;
  c += 4;
 }

}
#endif

For SSE vectorization, I expect a Speedup equal around 2 because I align data on 128bits = 16 bytes = 2* sizeof(double).

What I get in results for SSE vectorization is represented on the following figure :

Results with SSE

So, I think these results are valid because SpeedUp is around factor 2.

Now for AVX256, I get the following figure :

Results with AVX256

For AVX256 vectorization, I expect a Speedup equal around 4 because I align data on 256bits = 32 bytes = 4* sizeof(double).

But as you can see, I still get a factor 2 and not 4 for SpeedUp.

I don't understand why I get the same results for Speedup with SSE and AVX vectorization.

Does it come from "compilation flags", from my model of processor, ... I don't know.

Here are the compilation command line that I have done for all above results :

For SSE :

gcc-mp-4.9 -DSSE -O3 -msse main_benchmark.c -o vectorizedExe

For AVX256 :

gcc-mp-4.9 -DAVX256 -O3 -Wa,-q -mavx main_benchmark.c -o vectorizedExe

Moreover, with my model of processor, could I use AVX512 vectorization ? (Once the issue of this question will be solved).

Thanks for your help

UPDATE 1

I tried the different options of @Mischa but still can't get a factor 4 for speedup with AVX flags and option. You can take a look at my C source on http://example.com/test_vectorization/main_benchmark.c.txt (with .txt extension for direct view into browser) and the shell script for benchmarking is http://example.com/test_vectorization/run_benchmark .

As said @Mischa, I try to apply the following commande line for compilation :

$GCC -O3 -Wa,-q -mavx -fprefetch-loop-arrays main_benchmark.c -o vectorizedExe

but code genereated has not AVX instructions.

if you could you take a look at these files, this would be great. Thanks.

  • http://stackoverflow.com/questions/42964820/why-is-this-simd-multiplication-not-faster-than-non-simd-multiplication/42972674#42972674 – Z boson Apr 06 '17 at 09:08
  • 1
    What is your speedup relative to? If you use `foo(int size, double *a, double *b, double *c) { for(int i=0; i – Z boson Apr 06 '17 at 09:39
  • No, I use `-O0` flag and `#ifdef NOVEC void addition_tab(int size, double *a, double *b, double *c) { int i; // Classical sum for (i=0; i –  Apr 06 '17 at 10:32
  • You should enable optimization, passing at least `-O1` to `gcc` (and preferably `-O2` or `-O3` with `-march=native`). Benchmarking unoptimized binary is meaningless. – Basile Starynkevitch Oct 30 '17 at 16:02

2 Answers2

1

You are hitting the wall for cache->ram transfer. Your core7 has a 64 byte cache line. For sse2, 16 byte store requires a 64 byte load, update, and queue back to ram. 16 byte loads in ascending order benefit from automatic prefetch prediction, so you get some load benefit. Add mm_prefetch of destination memory; say, 256 bytes ahead of the next store. Same applies to avx2 32-byte stores.

Mischa
  • 2,240
  • 20
  • 18
  • -@Mischa thanks, could you give me please a little code snippet for using correctly `mm_prefecth`, I didn't find a lot of documentation about it –  Apr 06 '17 at 05:39
0

NP. There are options:

(1) x86-specific code:

#include <emmintrin.h> ... for (int i=size; ...) { _mm_prefetch(256+(char*)c, _MM_HINT_T0); ... _mm256_store_pd(c, sum);

(2) gcc-specific code: for (int i=size; ...) { __builtin_prefetch(c+32); ...

(3) gcc -fprefetch-array-loops --- the compiler knows best.

(3) is the best if your version of gcc supports it. (2) is next-best, if you compile and run on same hardware. (1) is portable to other compilers.

"256", unfortunately, is a guestimate, and hardware-dependent. 128 is a minimum, 512 a maximum, depending on your CPU:RAM speed. If you switch to _mm512*(), double those numbers.

If you are working across a range of processors, may I suggest compiling in a way that covers all cases, then test cpuid(ax=0)>=7, then cpuid(ax=7,cx=0):bx & 0x04000010 at runtime (0x10 for AVX2, 0x04000000 for AVX512 incl prefetch).

BTW if you are using gcc and specifying -mavx or -msse2, the compiler defines builtin macros __AVX__ or __SSE2__ for you; no need for -DAVX256. To support archaic 32-bit processors, -m32 unfortunately disables __SSE2__ hence effectively disables \#include <emmintrin.h> :-P

HTH

Mischa
  • 2,240
  • 20
  • 18
  • -@Mischa I tried your different options but still can't get a factor 4 for speedup with AVX flags and option. You can take a look at my C source on http://beulu.com/test_vectorization/main_benchmark.c.txt (with .txt extension for direct view into browser) and the shell script for benchmarking is http://beulu.com/test_vectorization/run_benchmark . if you could you take a look at these files, this would be great. Thanks ps: i also give these files into UPDATE 1 below my first post. –  Apr 07 '17 at 10:00
  • 1
    It will take a few days to nail down an iCore7. Meanwhile I took your code and was surprised that SSE2 doubles pair ops were only ~12% faster than simple doubles. If AVX doubles quad ops are close to SSE2, then manual prefetch is no help; internal prefetch prediction is already optimal. Hmmm... – Mischa Apr 09 '17 at 05:13
  • -@Mischa would you have another option to get a speedup x4 between the sequential version (I mean no vectorization) and AVX version with double ? or is it maybe the fault of my little code which doesn't allow to get this factor ? Any help is welcome, regards –  Apr 19 '17 at 12:16
  • Sorry l've had no time for followup. Specific hardware affects this a lot :-( On another tack, if you control the hardware, CUDA beats everything else. – Mischa Apr 19 '17 at 15:02
  • Haswell already has very good HW prefetch that includes prefetching into the next page (new with IvyBridge), and speculative TLB loading. This is the ideal case for HW prefetch, and SW prefetch is unlikely to do anything. If I was going to try anything, I'd try a bit of loop unrolling so more work could fit in the same amount of uops in the ROB, letting out-of-order execution better hide latency hiccups. – Peter Cordes Oct 31 '17 at 06:52
  • Triad (`c[i] = a[i]+b[i]`) is mostly a memory benchmark, and with `double` even scalar can come close to saturating memory bandwidth. (Although gcc should auto-vectorize it with `-O3` even without `-ffast-math`, so maybe that was happening with the "scalar" version, @youpilat13?) – Peter Cordes Oct 31 '17 at 06:54