SSE runs slow after using AVX

Question

I have a strange issue with some SSE2 and AVX code I have been working on. I am building my application using GCC which runtime cpu feature detection. The object files are built with seperate flags for each CPU feature, for example:

g++ -c -o ConvertSamples_SSE.o ConvertSamples_SSE.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -msse
g++ -c -o ConvertSamples_SSE2.o ConvertSamples_SSE2.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -msse2
g++ -c -o ConvertSamples_AVX.o ConvertSamples_AVX.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -mavx

When I first launch the program, I find that the SSE2 routines are as per normal with a nice speed boost over the non SSE routines (around 100% faster). After I run any AVX routine, the exact same SSE2 routine now runs much slower.

Could someone please explain what the cause of this may be?

Before the AVX routine runs, all the tests are around 80-130% faster then FPU math, as can be seen here, after the AVX routine runs, the SSE routines are much slower.

If I skip the AVX test routines I never see this performance loss.

Here is my SSE2 routine

void Float_S16(const float *in, int16_t *out, const unsigned int samples)
{
  static float  ratio = (float)Limits<int16_t>::range() / (float)Limits<float>::range();
  static __m128 mul   = _mm_set_ps1(ratio);

  unsigned int i;
  for (i = 0; i < samples - 3; i += 4, in += 4, out += 4)
  {
    __m128i con = _mm_cvtps_epi32(_mm_mul_ps(_mm_load_ps(in), mul));
    out[0] = ((int16_t*)&con)[0];
    out[1] = ((int16_t*)&con)[2];
    out[2] = ((int16_t*)&con)[4];
    out[3] = ((int16_t*)&con)[6];
  }

  for (; i < samples; ++i, ++in, ++out)
    *out = (int16_t)lrint(*in * ratio);
}

And the AVX version of the same.

void Float_S16(const float *in, int16_t *out, const unsigned int samples)
{
  static float ratio = (float)Limits<int16_t>::range() / (float)Limits<float>::range();
  static __m256 mul  = _mm256_set1_ps(ratio);

  unsigned int i;
  for (i = 0; i < samples - 7; i += 8, in += 8, out += 8)
  {
    __m256i con = _mm256_cvtps_epi32(_mm256_mul_ps(_mm256_load_ps(in), mul));
    out[0] = ((int16_t*)&con)[0];
    out[1] = ((int16_t*)&con)[2];
    out[2] = ((int16_t*)&con)[4];
    out[3] = ((int16_t*)&con)[6];
    out[4] = ((int16_t*)&con)[8];
    out[5] = ((int16_t*)&con)[10];
    out[6] = ((int16_t*)&con)[12];
    out[7] = ((int16_t*)&con)[14];
  }

  for(; i < samples; ++i, ++in, ++out)
    *out = (int16_t)lrint(*in * ratio);
}

I have also run this through valgrind which detects no errors.

@Gilles using `clock_gettime(CLOCK_MONOTONIC, &start);` before and after, then calculating the difference. — Geoffrey, Oct 15 '15 at 13:27
I've run into curious problems with mixed SSEX and AVX code ..., mostly because Link Time code generation/etc. problems. Look (and maybe post) your assembly files. — Christopher, Oct 15 '15 at 13:31
Ok, seems this is not actually faster, by dumping out the metrics I can actually see that everything is slower, not sure why though, it does seem to be linker related though. — Geoffrey, Oct 15 '15 at 13:50
Do you think it's related to this? [Intel: Avoiding AVX-SSE Transition Penalties](https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties); [Intel® AVX State Transitions: Migrating SSE Code to AVX](https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx) — Nayuki, Oct 15 '15 at 13:51
@NayukiMinase You are dead on the money!, adding a `_mm256_zeroall ` to the end of each of my AVX methods has resolved the issue, please submit your answer and I will accept it. — Geoffrey, Oct 15 '15 at 13:59
Okay... but I think your problem description is a bit strange. You're getting a performance boost in SSE by executing AVX code, instead of a performance penalty? — Nayuki, Oct 15 '15 at 14:06
@NayukiMinase. No I am not, I thought I was but it turned out that the baseline code had slowed right down due to gcc performing SSE optimizations. I have fixed the question. — Geoffrey, Oct 15 '15 at 14:10
Besides the mixing AVX/SSE issue, don't benchmark with `-O0`, it's silly. Use at least `-Og`, preferably `-O3`. — Peter Cordes, Oct 15 '15 at 23:44
@PeterCordes, the benchmark is a relative indicator, more of a sanity check. -O0 prevents strangeness when stepping code in the debugger. — Geoffrey, Oct 16 '15 at 03:47
@Geoffrey: `-Og` makes debuggable code that can be single-stepped line-by-line. It's apparently the suggested option for edit/compile/debug cycles. Benchmarking with `-O0` is sometimes useful, sometimes not. It's possible for source A to be faster than source B with `-O3`, but slower with `-O0`. I can't think of any examples with a huge difference, but if you're ever going to look at timing numbers, use `-Og`. — Peter Cordes, Oct 17 '15 at 19:18
@PeterCordes: Thanks, I was unaware of this flag, I will use it in the future, but for the purposes of this example it did not matter. — Geoffrey, Oct 21 '15 at 06:52
Stack Overflow's canonical Q&A for SSE/AVX transition penalties is [Why is this SSE code 6 times slower without VZEROUPPER on Skylake?](//stackoverflow.com/q/41303780), where the answer explains both SKL and HSW/ICL. Pointing future readers there makes the most sense. It's 100% normal for old questions to be closed as duplicates when there's a newer more-canonical Q&A. (Also, compilers these days don't require manual `vzeroupper`; that's done by default in AVX code that calls unknown functions, unless you use ` -mno-vzeroupper`. Also, VZEROUPPER is faster than VZEROALL.) — Peter Cordes, Jan 28 '22 at 03:22
You've been on SO for 10 years, and have almost 10k rep; surprised this is the first time you've seen newer canonical duplicates get used. e.g. on meta: [Should we really mark new questions as duplicates of old crappy ones?](https://meta.stackoverflow.com/q/258697) the consensus answer is no. (This question isn't "crappy", but it's about the same problem and the newer Q&A explains it for all Intel CPUs, covering both styles of penalty.) Basically this whole question is outdated because compilers use `vzeroupper` for you, so it seemed like *something* should change. — Peter Cordes, Jan 28 '22 at 03:28
Fair enough, feel free to close it then.... just didn't make much sense to me. — Geoffrey, Jan 28 '22 at 03:34

score 16 · Accepted Answer · answered Oct 15 '15 at 14:04

Mixing AVX code and legacy SSE code incurs a performance penalty. The most reasonable solution is to execute the VZEROALL instruction after an AVX segment of code, especially just before executing SSE code.

As per Intel's diagram, the penalty when transitioning into or out of state C (legacy SSE with upper half of AVX registers saved) is in the order of 100 clock cycles. The other transitions are only 1 cycle:

References:

This issue can effects drastic effects which seem totally unrelated to this penalty. See [this question](http://stackoverflow.com/q/21960229/2542702) where the OP saw a speed up with over 500 threads on a system that only had eight hyper-threads. — Z boson, Oct 16 '15 at 08:37

SSE runs slow after using AVX

1 Answers1