AVX2 code slower then without AVX2

Question

I have been trying to get started with the AVX2 instructions with not a lot of luck (this list of functions have been helpful). At the end, I got my first program compiling and doing what I wanted. The program that I have to do takes two u_char and compounds a double out of it. Essentially, I use this to decode data stored in an array of u_char from a camera but I do not think is relevant for this question.

The process of obtaining the doubleof of the two u_char is:

double result = sqrt(double((msb<<8) + lsb)/64);

where msb and lsb are the two u_char variables with the most significant bits (msb) and the less significant bits (lsb) of the double to compute. The data is stored in an array representing a row-major matrix where the msb and lsb of the value encoded column i are in the second and third rows respectively. I have coded this with and without AVX2:

void getData(u_char* data, size_t cols, std::vector<double>& info)
{
  info.resize(cols);
  for (size_t i = 0; i < cols; i++)
  {
    info[i] = sqrt(double((data[cols + i] << 8) + data[2 * cols + i]) / 64.0);
    ;
  }
}

void getDataAVX2(u_char* data, size_t cols, std::vector<double>& info)
{
  __m256d dividend = _mm256_set_pd(1 / 64.0, 1 / 64.0, 1 / 64.0, 1 / 64.0);
  info.resize(cols);
  __m256d result;
  for (size_t i = 0; i < cols / 4; i++)
  {
    __m256d divisor = _mm256_set_pd(double((data[4 * i + 3 + cols] << 8) + data[4 * i + 2 * cols + 3]),
                                    double((data[4 * i + 2 + cols] << 8) + data[4 * i + 2 * cols + 2]),
                                    double((data[4 * i + 1 + cols] << 8) + data[4 * i + 2 * cols + 1]),
                                    double((data[4 * i + cols] << 8) + data[4 * i + 2 * cols]));
    _mm256_storeu_pd(&info[0] + 4 * i, _mm256_sqrt_pd(_mm256_mul_pd(divisor, dividend)));
  }
}

However, to my surprise, this code is slower than the normal one? Any ideas on how to speed it up?

I am compiling with c++ (7.3.0) with the following options -std=c++17 -Wall -Wextra -O3 -fno-tree-vectorize -mavx2. I have checked as explained here and my CPU (Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz) supports AVX2.

To check which one is faster is using time. The following function gives me timestamp:

inline double timestamp()
{
  struct timeval tp;
  gettimeofday(&tp, nullptr);
  return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}

I get timestamp before and after each function getData and getDataAVX2 and subtract them to get the elapsed time on each function. The overall main is the following:

int main(int argc, char** argv)
{


  u_char data[] = {
0xf,  0xf,  0xf,  0xf,  0xf,  0xf,  0xf,  0xf,  0xf,  0xf,  0x11, 0xf,  0xf,  0xf,  0xf,  0xf,  0x10, 0xf,  0xf,
0xf,  0xf,  0xe,  0x10, 0x10, 0xf,  0x10, 0xf,  0xf,  0x10, 0xf,  0xf,  0xf,  0xf,  0xf,  0xf,  0x10, 0x10, 0xf,
0x10, 0xf,  0xe,  0xf,  0xf,  0x10, 0xf,  0xf,  0x10, 0xf,  0xf,  0xf,  0xf,  0x10, 0xf,  0xf,  0xf,  0xf,  0xf,
0xf,  0xf,  0xf,  0x10, 0xf,  0xf,  0xf,  0x10, 0xf,  0xf,  0xf,  0xf,  0xe,  0xf,  0xf,  0xf,  0xf,  0xf,  0x10,
0x10, 0xf,  0xf,  0xf,  0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2,
0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2,
0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2,
0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2,
0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xd3, 0xd1, 0xca, 0xc6, 0xd2, 0xd2, 0xcc, 0xc8, 0xc2, 0xd0, 0xd0,
0xca, 0xc9, 0xcb, 0xc7, 0xc3, 0xc7, 0xca, 0xce, 0xca, 0xc9, 0xc2, 0xc8, 0xc2, 0xbe, 0xc2, 0xc0, 0xb8, 0xc4, 0xbd,
0xc5, 0xc9, 0xbc, 0xbf, 0xbc, 0xb5, 0xb6, 0xc1, 0xbe, 0xb7, 0xb9, 0xc8, 0xb9, 0xb2, 0xb2, 0xba, 0xb4, 0xb4, 0xb7,
0xad, 0xb2, 0xb6, 0xab, 0xb7, 0xaf, 0xa7, 0xa8, 0xa5, 0xaa, 0xb0, 0xa3, 0xae, 0xa9, 0xa0, 0xa6, 0xa5, 0xa8, 0x9f,
0xa0, 0x9e, 0x94, 0x9f, 0xa3, 0x9d, 0x9f, 0x9c, 0x9e, 0x99, 0x9a, 0x97, 0x4,  0x5,  0x4,  0x5,  0x4,  0x4,  0x5,
0x5,  0x5,  0x4,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x4,  0x4,  0x4,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,
0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,
0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x4,  0x4,  0x4,  0x5,  0x5,  0x5,  0x4,  0x4,
0x5,  0x5,  0x5,  0x5,  0x4,  0x5,  0x5,  0x4,  0x4,  0x6,  0x4,  0x4,  0x6,  0x5,  0x4,  0x5,  0xf0, 0xf0, 0xf0,
0xf0, 0xf0, 0xf0, 0xe0, 0xf0, 0xe0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0,
0xf0, 0xf0, 0xe0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0,
0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0,
0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0,
0xf0
  };
  size_t cols = 80;

  // Normal
  std::cout << "Computing with normal way" << std::endl;
  std::vector<double> info;
  double tstart_normal = timestamp();
  getData(data, cols, info);
  double time_normal = timestamp() - tstart_normal;

  // AVX2
  std::cout << "Computing with avx" << std::endl;
  std::vector<double> info_avx2;
  double tstart_avx2 = timestamp();
  getDataAVX2(data, cols, info_avx2);
  double time_avx2 = timestamp() - tstart_avx2;

  // Display difference
  std::cout << "Time normal: " << time_normal << " s" << std::endl;
  std::cout << "Time AVX2:   " << time_avx2 << " s" << std::endl;
  std::cout << "Time improvement AVX2: " << time_normal / time_avx2 << std::endl;

  // Write to file
  std::ofstream file;
  file.open("out.csv");
  for (size_t i = 0; i < cols; i++)
  {
    file << info[size_t(i)] << "," << info_avx2[size_t(i)];
    file << std::endl;
  }
  file.close();

  // Exit
  return 0;
}

The full example can be found here.

Fix the title and body of your question to remove the stuff about illegal instruction, so you aren't wasting people's time reading carefully until they get to the very end and find out you already solved that part. — Peter Cordes, Aug 14 '18 at 05:55
Why are you converting the two integer halves `double` separately before adding, instead of doing an integer byte-swap? **What compiler, what options, and what CPU are you testing on? All of that matters**. See https://stackoverflow.com/tags/x86/info for x86 performance links. — Peter Cordes, Aug 14 '18 at 05:58
`info.resize(cols);` will fill new elements with zeros, right before you're about to store to them anyway. Most C++ compilers fail to optimize this away, unfortunately. You're not even doing `_mm256_storeu_pd` directly into the `std::vector`, though. If you're lucky, a compiler might optimize away `result` and store directly into the vector, but then there's no clear way to use `.clear()` / `.reserve()` / `emplace_back`. — Peter Cordes, Aug 14 '18 at 06:04
Sorry abbout the Ubuntu 18.04 alone information, I saved it halfway through editing. — apalomer, Aug 14 '18 at 06:08
Ok, so you have a Haswell CPU and you did enable optimization, but you left out the gcc *version*. `c++` on Ubuntu 18.04 might still be gcc6.x. ([When will GCC be updated in 16.04 and 18.04?](https://askubuntu.com/q/1008744)), but IDK. You should probably enable `-O3 -march=native` to tune for your specific CPU (and enable FMA and other instruction-sets it supports besides just AVX2), but probably the biggest problem is terrible code-gen for `set_pd`; load a 128-bit vector and shuffle it yourself to byte-swap, and `vpmovzxwd` up to set up for packed int->double. — Peter Cordes, Aug 14 '18 at 06:11
Actually, the code-gen is a lot less bad than I expected with g++7.3 or clang6.0 (https://godbolt.org/g/pXP9jR). But all the loading and int->double is done with scalar instructions, then shuffle to create a vector for `vmulpd` / `vsqrtpd`. I would have thought that'd be faster than pure scalar all the way, but maybe not. You probably still bottleneck on `vsqrtpd ymm` throughput of one per 16 to 28 cycles. The scalar or xmm version has twice the throughput, for 1/4 or half as much work per instruction, but maybe OoO scheduling works better. — Peter Cordes, Aug 14 '18 at 06:28
I included a version in the Godbolt link in my last comment that does the integer combining *before* conversion to `double`. gcc still does everything scalar, but clang6.0 optimizes it with `vpmovzxbd` (`_mm_cvtepu8_epi32`) loads -> vector left-shift -> vector OR -> packed int->double conversion (`_mm256_cvtepi32_pd`). You could do the same thing with intrinsics to get gcc to emit similar asm, and see if that helps. Or just install clang. — Peter Cordes, Aug 14 '18 at 06:32
I have tried adding the `-march=native` (the `-O3` was already there) but did not improve it very much. I also incorporated some of the changes you mentioned (`_mm256_storeu_pd(&info[0] + 4 * i, _mm256_sqrt_pd(_mm256_mul_pd(divisor, dividend)));` and `double((data[4 * i + 3 + cols] << 8) + data[4 * i + 2 * cols + 3])` with no luck. The "normal" code is still faster than the AVX2 one... — apalomer, Aug 14 '18 at 06:38
How are you benchmarking this, exactly? Are you testing the scalar one 2nd, giving the CPU time to ramp up to max Turbo frequency? Your SIMD version should be running almost exactly twice as fast as your scalar version on Haswell, both totally bottlenecked on SQRT throughput. Unless you use a higher-throughput approximation for sqrt (like [`float` rcpss + a Newton iteration](https://stackoverflow.com/questions/31555260/) then convert to `double`), you can't speed this up any more than 2 `double`s per 8 to ~14 clock cycles on Haswell, according to https://agner.org/optimize/. — Peter Cordes, Aug 14 '18 at 06:41
I have edited the question to incorporate how I test the time spent in each function. — apalomer, Aug 14 '18 at 06:55
Do you really need this to be `double`? `float` would be about 4x faster. Twice as many elements per vector, and ~twice the per-instruction throughput for `vsqrtps ymm` vs. `vsqrtpd ymm`. Or if you want a faster approximate sqrt built with `_mm256_rsqrt_ps` + Newton-Raphson, then you'd only have about `float` precision anyway. Most high-performance computing uses `float` instead of `double` whenever possible. — Peter Cordes, Aug 14 '18 at 07:21
I don't have OpenCV installed. Can you make a portable test-case I can try on my Skylake? Your results don't make any sense, unless memory allocation is making a huge difference. How short is your timing interval? Is it only microseconds or something? Like so short that you might be measuring the 256-bit AVX warm-up effect? (Agner Fog describes it for Skylake, but wasn't able to measure it himself on Haswell: https://www.agner.org/optimize/blog/read.php?i=415). You'd avoid that and go about as fast for sqrt in the fully-warmed-up case with 128-bit vectors. — Peter Cordes, Aug 14 '18 at 07:24
Were you always testing with inputs that tiny? For only one run, no wonder you didn't get sensible results, especially with the vector resize inside the timed portion. Did you try doing them in the other order to see if the 2nd one is always faster? Or wrap a repeat loop around them? Your SIMD version doesn't do extra work to handle `cols` not being a multiple of 4, and the asm doesn't look like it should have any extra startup overhead vs. scalar, so my only guess is too little work to time, or AVX 256-bit warm-up effects. — Peter Cordes, Aug 14 '18 at 07:47
I was testing with a 2048 width image. I have done the test with thousands of images and then, the AVX2 does outperform the non AVX implementation. — apalomer, Aug 14 '18 at 07:54
In AVX mode, corresponding core [***lowers***](https://en.wikichip.org/wiki/intel/frequency_behavior) its clock. This means if you don't have enough quantity of data to feed, the vector parallelism on lower clock can not compensate the frequency you lost. — sandthorn, Aug 14 '18 at 10:40

Peter Cordes · Accepted Answer · 2018-08-14T09:56:26.093

Such a tiny amount of work in the timed interval is hard to measure accurately. cols = 80 is only 20 __m256d vectors.

Your test program on my Skylake system bounces around between 9.53674e-07 s, 1.19209e-06 s and 0 s for the times, with the AVX2 version usually faster. (I had a _mm_pause() busy-loop running on another core to peg all the cores at max speed. It's a desktop i7-6700k so all cores share the same core clock frequency.)

gettimeofday is apparently nowhere near precise enough to measure anything that short. struct timeval uses seconds and micro-seconds, not nanoseconds. But I did fairly consistently see the AVX2 version being faster on Skylake, compiled with g++ -O3 -march=native. I don't have a Haswell to test on. My Skylake is using hardware P-state power management, so even if I didn't peg the CPU frequency ahead of time, it would ramp up to max very quickly. Haswell doesn't have that feature, so that's another reason things can be weird in yours.

If you want to measure wall-clock time (instead of core clock cycles), use std::chrono like a normal person. Correct way of portably timing code using C++11.

Warm-up effects are going to dominate, and you're including the std::vector::resize() inside the timed interval. The two different std::vector<double> objects have to allocate memory separately, so maybe the 2nd one needs to get a new page from the OS and takes way longer. Maybe the first one was able to grab memory from the free-list, if something before main (or something in cout <<) did some temporary allocation and then shrunk or freed it.

There are many possibilities here: first, some people have reported seeing 256-bit vector instructions run slower for the first few microseconds on Haswell, like Agner Fog measured on Skylake.

Possibly the CPU decided to ramp up to max turbo during the 2nd timed interval (the AVX2 one). That takes maybe 20k clock cycles on an i7-4700MQ (2.4GHz Haswell). (Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC).

Maybe after a write system call (from cout <<) the TLB misses or branch misses hurt more for the 2nd function? (With Spectre + Meltdown mitigation enabled in your kernel, you should expect code to run slow right after returning from a system call.)

Since you didn't use -ffast-math, GCC won't have turned your scalar sqrt into a rsqrtss approximation, especially because it's double not float. Otherwise that could explain it.

Look at how the time scales with problem size to make sure your microbenchmark is sane, and unless your trying to measure transient / warm-up effects, repeat the work many times. If it doesn't optimize away, just slap a repeat loop around the function call inside the timed interval (instead of trying to add up times from multiple intervals). Check the compiler-generated asm, or at least check that the time scales linearly with the repeat count. You might make the function __attribute__((noinline,noclone)) as a way to defeat the optimizer from optimizing across repeat-loop iterations.

Outside of warm-up effects, your SIMD version should be about 2x as fast as scalar on your Haswell.

Both scalar and SIMD versions bottleneck on the divide unit, even with inefficient scalar calculation of inputs before merging into a __m256d. Haswell's FP divide/sqrt hardware is only 128 bits wide (so vsqrtpd ymm is split into two 128-bit halves). But scalar is only taking advantage of half the possible throughput.

float would give you a 4x throughput boost: twice as many elements per SIMD vector, and vsqrtps (packed-single) has twice the throughput of vsqrtpd (packed-double) on Haswell. (https://agner.org/optimize/). It would also make it easier to use x * approx_rsqrt(x) as a fast approximation for sqrt(x), probably with a Newton-Raphson iteration to get up from ~12 bit precision to ~24 (almost as accurate as _mm256_sqrt_ps). See Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision. (If you had enough work to do in the same loop that you didn't bottleneck on divider throughput, the actual sqrt instruction can be good.)

You could SIMD sqrt with float and then convert to double if you really need your output format to be double for compat with the rest of your code.

Optimizing the stuff other than the sqrt:

This probably won't be any faster on Haswell, but it is probably more Hyperthreading-friendly if the other threads aren't using SQRT / DIV.

It uses SIMD to load and unpack the data: a<<8 + b is best done by interleaving bytes from b and a to make 16-bit integers, with _mm_unpacklo/hi_epi8. Then zero-extend to 32-bit integers so we can use SIMD int->double conversion.

This results in 4 vectors of double for each pair of __m128i of data. Using 256-bit vectors here would just introduce lane-crossing problems and require extracting down to 128 because of how _mm256_cvtepi32_pd(__m128i) works.

I changed to using _mm256_storeu_pd into the output directly, instead of hoping that gcc would optimize away the one-element-at-a-time assignment.

I also noticed that the compiler was reloading &info[0] after every store, because its alias-analysis couldn't prove that _mm256_storeu_pd was only modifying the vector data, not the control block. So I assigned the base address to a double* local variable that the compiler is sure isn't pointing to itself.

#include <immintrin.h>
#include <vector>

inline
__m256d cvt_scale_sqrt(__m128i vi){
    __m256d vd = _mm256_cvtepi32_pd(vi);
    vd = _mm256_mul_pd(vd, _mm256_set1_pd(1./64.));
    return _mm256_sqrt_pd(vd);
}

// assumes cols is a multiple of 16
// SIMD for everything before the multiple/sqrt as well
// but probably no speedup because this and others just bottleneck on that.
void getDataAVX2_vector_unpack(const u_char*__restrict data, size_t cols, std::vector<double>& info_vec)
{
  info_vec.resize(cols);    // TODO: hoist this out of the timed region

  double *info = &info_vec[0];  // our stores don't alias the vector control-block
                                // but gcc doesn't figure that out, so read the pointer into a local

  for (size_t i = 0; i < cols / 4; i+=4)
  {
      // 128-bit vectors because packed int->double expands to 256-bit
      __m128i a = _mm_loadu_si128((const __m128i*)&data[4 * i + cols]);   // 16 elements
      __m128i b = _mm_loadu_si128((const __m128i*)&data[4 * i + 2*cols]);
      __m128i lo16 = _mm_unpacklo_epi8(b,a);                // a<<8 | b  packed 16-bit integers
      __m128i hi16 = _mm_unpackhi_epi8(b,a);

      __m128i lo_lo = _mm_unpacklo_epi16(lo16, _mm_setzero_si128());
      __m128i lo_hi = _mm_unpackhi_epi16(lo16, _mm_setzero_si128());

      __m128i hi_lo = _mm_unpacklo_epi16(hi16, _mm_setzero_si128());
      __m128i hi_hi = _mm_unpackhi_epi16(hi16, _mm_setzero_si128());

      _mm256_storeu_pd(&info[4*(i + 0)], cvt_scale_sqrt(lo_lo));
      _mm256_storeu_pd(&info[4*(i + 1)], cvt_scale_sqrt(lo_hi));
      _mm256_storeu_pd(&info[4*(i + 2)], cvt_scale_sqrt(hi_lo));
      _mm256_storeu_pd(&info[4*(i + 3)], cvt_scale_sqrt(hi_hi));
  }
}

This compiles to a pretty nice loop on the Godbolt compiler explorer, with g++ -O3 -march=haswell.

To handle cols not being a multiple of 16, you'll need another version of the loop, or padding or something.

But having fewer instructions other than vsqrtpd doesn't help at all with that bottleneck.

According to IACA, all the SIMD loops on Haswell bottleneck on the divider unit, 28 cycles per vsqrtpd ymm, even your original which does a large amount of scalar work. 28 cycles is a long time.

For large inputs, Skylake should be a bit more than twice as fast because of its improved divider throughput. But float would still be a ~4x speedup, or more with vrsqrtps.

Thanks!! I changed to `std::chrono::high_resolution_clock` for the benchmarking. I managed to get it to work with `float` obtaining the same numerical results (meaning that the computations are all right). As you mentioned, the code is faster than in `double`. But to my surprise, it is 6-7x faster than the normal code. When you said that changing from `double` to `float` I'd get 4x speed did you mean from the `dobule` AVX cod or from the original non-AVX function? You can see the implementation for `float` [here](https://github.com/apalomer/so_example_avx/blob/master/src/test_avx2_sqrt.cpp) — apalomer, Aug 15 '18 at 00:14
@apalomer: I meant over SIMD `double` using 128 or 256-bit vectors. We'd expect about 8x speedup for scalar `double` -> SIMD `float`, from the divider throughput bottleneck for large problem sizes. If your inputs weren't all worst-case throughput for scalar double sqrt, that would explain the speedup factor being a bit less than 8x. — Peter Cordes, Aug 15 '18 at 00:19

AVX2 code slower then without AVX2

1 Answers1

Linked