SSE and AVX intrinsics mixture

Question

In addition to SSE-copy, AVX-copy and std::copy performance. Suppose that we need to vectorize some loop in following manner: 1) vectorize first loop-batch (which is multiple by 8) via AVX. 2) split loop's remainder into two batches. Vectorize the batch that is a multiple of 4 via SSE. 3) Process residual batch of entire loop via serial routine. Let's consider example of copying arrays:

#include <immintrin.h>

template<int length,
         int unroll_bound_avx = length & (~7),
         int unroll_tail_avx  = length - unroll_bound_avx,
         int unroll_bound_sse = unroll_tail_avx & (~3),
         int unroll_tail_last = unroll_tail_avx - unroll_bound_sse>
void simd_copy(float *src, float *dest)
{
    auto src_  = src;
    auto dest_ = dest;

    //Vectorize first part of loop via AVX
    for(; src_!=src+unroll_bound_avx; src_+=8, dest_+=8)
    {
         __m256 buffer = _mm256_load_ps(src_);
         _mm256_store_ps(dest_, buffer);
    }

    //Vectorize remainder part of loop via SSE
    for(; src_!=src+unroll_bound_sse+unroll_bound_avx; src_+=4, dest_+=4)
    {
        __m128 buffer = _mm_load_ps(src_);
        _mm_store_ps(dest_, buffer);
    }

    //Process residual elements
    for(; src_!=src+length; ++src_, ++dest_)
        *dest_ = *src_;
}

int main()
{  
    const int sz = 15;
    float *src = (float *)_mm_malloc(sz*sizeof(float), 16);
    float *dest = (float *)_mm_malloc(sz*sizeof(float), 16);
    float a=0;
    std::generate(src, src+sz, [&](){return ++a;});

    simd_copy<sz>(src, dest);

    _mm_free(src);
    _mm_free(dest);
}

Is it correct to use both SSE and AVX? Do I need to avoid AVX-SSE transitions?

You can mix all you want. Just make sure you have the right compiler flag enabled to force all SIMD instructions to VEX encoding. — Mysticial, Aug 19 '13 at 17:29
@Mystical, compiler - gcc 4.7., flags -O2 -msse -msse2 -msse4.2 -mavx -mfpmath=sse. Is this correct? — gorill, Aug 19 '13 at 17:31
Yes, that's fine. Although `-mavx` is all you need. Specifying any SIMD option automatically enables all the ones below it. — Mysticial, Aug 19 '13 at 17:33
@Mystical, do I understand that -msse -msse2 -msse4.2 flags is not need? — gorill, Aug 19 '13 at 17:36
Correct. Any processor with AVX is guaranteed to have to SSE, SSE2, SSE4.2. — Mysticial, Aug 19 '13 at 17:38

score 10 · Accepted Answer · edited May 23 '17 at 12:01

10

You can mix SSE and AVX intrinsics all you want.

The only thing you want to make sure is to specify the correct compiler flag to enable AVX.

GCC: -mavx
Visual Studio: /arch:AVX

Failing to do so will either result in the code not compiling (GCC), or in the case of Visual Studio,
this kind of crap:

Using AVX CPU instructions: Poor performance without "/arch:AVX"

What the flag does is that it forces all SIMD instructions to use VEX encoding to avoid the state-switching penalties described in the question above.

edited May 23 '17 at 12:01

Community

1
1

answered Aug 19 '13 at 18:29

Mysticial

464,885
45
335
332

What about alignment? AVX 256 requires the data to be aligned on 32 bytes boundary, while SSE needs 16 bytes boundary. If you mix them, you need to align your data to 32 bytes, or align to 16 bytes and use unaligned AVX load/stores which is worse than the latter case I guess. – plasmacel Nov 16 '16 at 20:41
@plasmacel Alignment is a completely different topic that's irrelevant to mixing of SSE and AVX instructions. The mixing here is only about the instructions themselves and not the operands that they may take. – Mysticial Nov 16 '16 at 22:16

Leeor · Answer 2 · 2013-08-19T22:44:12.000

1

I humbly beg to differ - I would advise to try not to mix SSE and AVX, please read in the link Mystical wrote, it warns against such a mixture (although not stressing it hard enough). The question there is about different code paths for different machines according to AVX support, so there's no mixture - in your case the mix is very fine grained and would be destructive (incure internal delays due to the micro-architectural implementation).

To clarify - Mystical is right about the vex prefix in compilation, without it you'd be in a pretty bad shape as you incure SSE2AVX assists everytime since the upper parts of your YMM registers can't be ignored (unless explicitly using vzeroupper). However, there are more subtle effects even when using 128b AVX mixed with 256b AVX.

I also don't see the benefit of using SSE here, in you have a long loop (say N>100) you could get the benefit from AVX for the most part of it, and do the remainder in scalar code up to 7 iterations (you code may still have to do 3 of them). The performance loss is nothing compared to mixing AVX/SSE

Some more info on mixture - http://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf

edited Aug 19 '13 at 22:44

answered Aug 19 '13 at 19:39

Leeor

19,260
5
56
87

2

You should clarify. Do not mix *legacy-encoded* SSE and VEX-encoded AVX. If you use SSE *intrinsics* with the AVX compiler flags, then the SSE intrinsics will compile to VEX-encoded SSE. It is perfectly fine to mix VEX-encoded SSE with VEX-encoded AVX. – Mysticial Aug 19 '13 at 19:43
@Mystical: Quoting Intel's optimization guide - "With the exception of MMX instructions, almost all legacy 128-bit SSE instructions have AVX equivalents that support three operand syntax". Emphasis on almost. You're right in that it shouldn't cost anything to mix AVX256 with AVX128, since it's zeroing the upper part, but i'd still be extremely careful and check that all my legacy SSE code is indeed converted properly, and be wary of claims like "You can mix SSE and AVX intrinsics all you want". Having That said, I also don't see any reason to mix 128b code in the above case – Leeor Aug 19 '13 at 20:00
1

Can you give me example of a 128-bit SSE instruction that doesn't have VEX-encoded 128-bit AVX equivalent? I'd be surprised if any of them were actually performance critical instructions or were affected by the state-changes. – Mysticial Aug 19 '13 at 20:03
I think I remember something, i'll have to look for it. Keep in mind though that state changes are not only data-dependance blocks, they'd causes assists which block everything for a while. see here - http://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf , btw - this link also states that part of the compiler solution you gave relies on adding vzeroupper at the beginning of the function, which means the mix can not be intra-function – Leeor Aug 19 '13 at 20:23
2

And I'm saying that if you specify AVX to the compiler, it will do VEX-encoding so you don't even need `vzeroupper`. It's when you're calling into other modules/compilation units that aren't AVX-aware do you need to call `vzeroupper`. Likewise, when AVX-unaware code calls into a module that uses AVX. (And if you don't, you pay the penalty once for the cross-module call. That's how it was intentionally designed.) But within the same module you do not need to emit `vzeroupper`. The compiler already does everything VEX-encoded. – Mysticial Aug 19 '13 at 20:52
2

As for the "missing" instructions that you refer to which don't have VEX-encoding - I bet that they probably don't matter. Stuff like prefetch or extraction to general purpose registers don't need a VEX-encoding since they're 2-operand and only read the bottom 128-bits of the register. (So there's no need to unshelve the upper 128-bits from whatever external storage is used.) – Mysticial Aug 19 '13 at 20:55

SSE and AVX intrinsics mixture

2 Answers2