AVX scalar operations are much faster

Question

I test the following simple function

void mul(double *a, double *b) {
  for (int i = 0; i<N; i++) a[i] *= b[i];
}

with very large arrays so that it is memory bandwidth bound. The test code I use is below. When I compile with -O2 it takes 1.7 seconds. When I compile with -O2 -mavx it takes only 1.0 seconds. The non vex-encoded scalar operations are 70% slower! Why is this?

Here is the the assembly for -O2 and -O2 -mavx.

https://godbolt.org/g/w4p60f

System: i7-6700HQ@2.60GHz (Skylake) 32 GB mem, Ubuntu 16.10, GCC 6.3

Test code

//gcc -O2 -fopenmp test.c
//or
//gcc -O2 -mavx -fopenmp test.c
#include <string.h>
#include <stdio.h>
#include <x86intrin.h>
#include <omp.h>

#define N 1000000
#define R 1000

void mul(double *a, double *b) {
  for (int i = 0; i<N; i++) a[i] *= b[i];
}

int main() {
  double *a = (double*)_mm_malloc(sizeof *a * N, 32);
  double *b = (double*)_mm_malloc(sizeof *b * N, 32);

  //b must be initialized to get the correct bandwidth!!!
  memset(a, 1, sizeof *a * N);
  memset(b, 1, sizeof *b * N);

  double dtime;
  const double mem = 3*sizeof(double)*N*R/1024/1024/1024;
  const double maxbw = 34.1;
  dtime = -omp_get_wtime();
  for(int i=0; i<R; i++) mul(a,b);
  dtime += omp_get_wtime();
  printf("time %.2f s, %.1f GB/s, efficency %.1f%%\n", dtime, mem/dtime, 100*mem/dtime/maxbw);

  _mm_free(a), _mm_free(b);
}

FWIW I get around 0.8s for both on a lowly 2.6 GHz mobile Haswell CPU, compiling with clang. — Paul R, Apr 06 '17 at 13:51
@PaulR, thanks for checking. I can test it later on my Haswell system. I am getting strange results on my Skylake system that I don't get on Haswell so I would not be surprised. — Z boson, Apr 06 '17 at 14:00
@PaulR, I just figured it out! `__asm__ __volatile__ ( "vzeroupper" : : : );` right after the calls to `omp_get_wtime()` fixes it. — Z boson, Apr 06 '17 at 14:01
I must be seeing this issue http://stackoverflow.com/a/42328369/2542702. I though I did not have this problem with Ubuntu 16.10 but apparently I do. — Z boson, Apr 06 '17 at 14:02
Aha - I wasn't using OMP for timing. My times are actually now more like 0.75 s (I was including the memory alloc/init/free before). — Paul R, Apr 06 '17 at 14:02
Apparently not only glibc 2.23 sometimes returns a dirty upper state, but also the OpenMP lib. I wouldn't be surprised if there exists other libraries which return without a proper `vzeroupper` as well. Just insert a `vzeroupper` after each library call, if you want te be 100% sure to avoid this problem on Skylake with non-VEX encoded SSE code. — wim, Apr 06 '17 at 14:40
With the Intel compiler you can try `#ifndef __AVX__ ` (newline) `if (_may_i_use_cpu_feature (_FEATURE_AVX)) __asm__ __volatile__ ( "vzeroupper" : : : );` (newline) `#endif` (untested). Maybe it works?? This only inserts a `vzeroupper` if the source was compiled without AVX, but only executes a `vzeroupper` if the system supports it. — wim, Apr 06 '17 at 15:39
A dirty upper state will certainly explain this on Skylake and not Haswell. On Haswell, you pay the large penalty once to enter the split state - then you can run at full speed. On Skylake you pay little transition penalty, but you're stuck with false dependencies for the entire benchmark. — Mysticial, Apr 06 '17 at 15:41
@wim, my guess is that `omp_get_wtime()` calls `gettimeofdate` or some other glibc function. I think the problem is the first time it is called it uses a CPU dispatcher and this leaves it dirty. I only need to use `vzeroupper` after the first call to `omp_get_wtime()` to fix the problem. Somebody else found the problem in`_dl_runtime_resolve_avx()`. That looks like some kind of dispatcher to me. A can step through gdb (if I can figure out how to use it) to find out. — Z boson, Apr 07 '17 at 07:00
@wim `omp_get_wtime` calls `clock_gettime`. And `clock_gettime` calls `_dl_runtime_resolve_avx`. My guess is this is where the problem is. — Z boson, Apr 07 '17 at 07:18
Great! Good to know that a `vzeroupper` is only needed after the first library call. — wim, Apr 07 '17 at 07:23
Clang does not have the problem with `clock_gettime`. In Clang even gets a higher bandwidth for the scalar operation. — Z boson, Apr 07 '17 at 07:29
Okay, Clang has the same problem as GCC when I use `-O2 -fno-vectorize` — Z boson, Apr 07 '17 at 07:38
@wim, strangely with Clang I have use `__asm__ __volatile__ ( "vzeroupper" : : : );` after every call to `clock_gettime` but not with GCC. With GCC only only have to do it once. — Z boson, Apr 07 '17 at 07:40

score 5 · Answer 1 · edited May 23 '17 at 11:54

The problem is related to a dirty upper half of an AVX register after calling omp_get_wtime(). This is a problem particularly for Skylake processors.

The first time I read about this problem was here. Since then other people have observed this problem: here and here.

Using gdb I found that omp_get_wtime() calls clock_gettime. I rewrote my code to use clock_gettime() and I see the same problem.

void fix_avx() { __asm__ __volatile__ ( "vzeroupper" : : : ); }
void fix_sse() { }
void (*fix)();

double get_wtime() {
  struct timespec time;
  clock_gettime(CLOCK_MONOTONIC, &time);
  #ifndef  __AVX__ 
  fix();
  #endif
  return time.tv_sec + 1E-9*time.tv_nsec;
}

void dispatch() {
  fix = fix_sse;
  #if defined(__INTEL_COMPILER)
  if (_may_i_use_cpu_feature (_FEATURE_AVX)) fix = fix_avx;
  #else
  #if defined(__GNUC__) && !defined(__clang__)
  __builtin_cpu_init();
  #endif
  if(__builtin_cpu_supports("avx")) fix = fix_avx;
  #endif
}

Stepping through code with gdb I see that the first time clock_gettime is called it calls _dl_runtime_resolve_avx(). I believe the problem is in this function based on this comment. This function appears to only be called the first time clock_gettime is called.

With GCC the problem goes away using //__asm__ __volatile__ ( "vzeroupper" : : : ); after the first call with clock_gettime however with Clang (using clang -O2 -fno-vectorize since Clang vectorizes even at -O2) it only goes away using it after every call to clock_gettime.

Here is the code I used to test this (with GCC 6.3 and Clang 3.8)

#include <string.h>
#include <stdio.h>
#include <x86intrin.h>
#include <time.h>

void fix_avx() { __asm__ __volatile__ ( "vzeroupper" : : : ); }
void fix_sse() { }
void (*fix)();

double get_wtime() {
  struct timespec time;
  clock_gettime(CLOCK_MONOTONIC, &time);
  #ifndef  __AVX__ 
  fix();
  #endif
  return time.tv_sec + 1E-9*time.tv_nsec;
}

void dispatch() {
  fix = fix_sse;
  #if defined(__INTEL_COMPILER)
  if (_may_i_use_cpu_feature (_FEATURE_AVX)) fix = fix_avx;
  #else
  #if defined(__GNUC__) && !defined(__clang__)
  __builtin_cpu_init();
  #endif
  if(__builtin_cpu_supports("avx")) fix = fix_avx;
  #endif
}

#define N 1000000
#define R 1000

void mul(double *a, double *b) {
  for (int i = 0; i<N; i++) a[i] *= b[i];
}

int main() {
  dispatch();
  const double mem = 3*sizeof(double)*N*R/1024/1024/1024;
  const double maxbw = 34.1;

  double *a = (double*)_mm_malloc(sizeof *a * N, 32);
  double *b = (double*)_mm_malloc(sizeof *b * N, 32);

  //b must be initialized to get the correct bandwidth!!!
  memset(a, 1, sizeof *a * N);
  memset(b, 1, sizeof *b * N);

  double dtime;
  //dtime = get_wtime(); // call once to fix GCC
  //printf("%f\n", dtime);
  //fix = fix_sse;

  dtime = -get_wtime();
  for(int i=0; i<R; i++) mul(a,b);
  dtime += get_wtime();
  printf("time %.2f s, %.1f GB/s, efficency %.1f%%\n", dtime, mem/dtime, 100*mem/dtime/maxbw);

  _mm_free(a), _mm_free(b);
}

If I disable lazy function call resolution with -z now (e.g. clang -O2 -fno-vectorize -z now foo.c) then Clang only needs __asm__ __volatile__ ( "vzeroupper" : : : ); after the first call to clock_gettime just like GCC.

I expected that with -z now I would only need __asm__ __volatile__ ( "vzeroupper" : : : ); right after main() but I still need it after the first call to clock_gettime.

Nice code! From [this gcc webpage](https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.html) it was not clear to me whether you have to call `__builtin_cpu_init (void)` before calling `__builtin_cpu_supports("avx")` or not. Did you test your code on an old non-AVX cpu? — wim, Apr 07 '17 at 13:19
@wim, `dispatch` should not have been commented. That was because I was testing GCC only needing to call `vzeroupperonce` instead of every call. I did not know about `__builtin_cpu_init`. It worked without it (though I don't have a system without AVX to test on). I added it to my answer just to be safe. — Z boson, Apr 07 '17 at 13:24
`_dl_runtime_resolve_avx` is called only **at first call** to some function from different shared library file. Try disabling lazy binding (http://man7.org/linux/man-pages/man1/ld.1.html - "lazy .. tell the dynamic linker to defer function call resolution to the point when the function is called (lazy binding), rather than at load time. Lazy binding is the default.") with `export LD_BIND_NOW=1` (http://man7.org/linux/man-pages/man8/ld.so.8.html - "resolve all symbols at program startup instead of deferring") to disable calling of `_dl_runtime_resolve_avx` at runtime. — osgx, Apr 23 '17 at 20:13
@osgx `-z now` or `export LD_BIND_NOW=1` only make a difference with Clang. For some reason with Clang I need `__asm__ __volatile__ ( "vzeroupper" : : : );` after every call to `clock_gettime(CLOCK_MONOTONIC, &time);` unless I use `-z now` or `export LD_BIND_NOW=1`. With Clang I only need it after the first call even without `-z now` or `export LD_BIND_NOW=1`. — Z boson, Apr 24 '17 at 07:25
@wim Clang 3.9 does not support `__builtin_cpu_init`. but it does support the other builtins in my code. — Z boson, Apr 24 '17 at 07:26

AVX scalar operations are much faster

1 Answers1

Linked