Are older SIMD-versions available when using newer ones?

Question

When I can use SSE3 or AVX, are then older SSE versions as SSE2 or MMX available -
or do I still need to check for them separately?

As a general rule, you should probably check for a capability before you use it. However, the CPUID instructions that determine whether you have SSE3 or AVX will also determine whether you have SSE2 or MMX. If you just save the outputs of those CPUID instructions to appropriate variables, you can perform a single bit test whenever you want to use a specific instruction. — Ken P, May 20 '15 at 16:50
This has come up before on SO, but I can't seem to find the duplicate at the moment... — Paul R, May 20 '15 at 16:59
Intel CPUs are always backward compatible. Therefore if it supports an instruction set then it'll support all older versions — phuclv, May 20 '15 at 17:12

Chuck Walbourn · Accepted Answer · 2020-07-14T03:15:42.817

In general, these have been additive but keep in mind that there are differences between Intel and AMD support for these over the years.

If you have AVX, then you can assume SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE 4.2 as well. Remember that to use AVX you also need to validate the OSXSAVE CPUID bit is set to ensure the OS you are using actually supports saving the AVX registers as well.

You should still explicitly check for all the CPUID support you use in your code for robustness (say checking for AVX, OSXSAVE, SSE4, SSE3, SSSE3 at the same time to guard your AVX codepaths).

#include <intrin.h>

inline bool IsAVXSupported()
{
#if defined(_M_IX86 ) || defined(_M_X64)
   int CPUInfo[4] = {-1};
   __cpuid( CPUInfo, 0 );

   if ( CPUInfo[0] < 1  )
       return false;

    __cpuid(CPUInfo, 1 );

    int ecx = 0x10000000 // AVX
              | 0x8000000 // OSXSAVE
              | 0x100000 // SSE 4.2
              | 0x80000 // SSE 4.1
              | 0x200 // SSSE3
              | 0x1; // SSE3

    if ( ( CPUInfo[2] & ecx ) != ecx )
        return false;

    return true;
#else
    return false;
#endif
}

SSE and SSE2 are required for all processors capable of x64 native, so they are good baseline assumptions for all code. Windows 8.0, Windows 8.1, and Windows 10 explicitly require SSE and SSE2 support even for x86 architectures so those instruction sets are pretty ubiquitous. In other words, if you fail a check for SSE or SSE2, just exit the app with a fatal error.

#include <windows.h>

inline bool IsSSESupported()
{
#if defined(_M_IX86 ) || defined(_M_X64)
   return ( IsProcessorFeaturePresent( PF_XMMI_INSTRUCTIONS_AVAILABLE ) != 0 && IsProcessorFeaturePresent( PF_XMMI64_INSTRUCTIONS_AVAILABLE ) != 0 );
#else
    return false;
#endif
}

-or-

#include <intrin.h>

inline bool IsSSESupported()
{
#if defined(_M_IX86 ) || defined(_M_X64)
   int CPUInfo[4] = {-1};
   __cpuid( CPUInfo, 0 );

   if ( CPUInfo[0] < 1  )
       return false;

    __cpuid(CPUInfo, 1 );

    int edx = 0x4000000 // SSE2
              | 0x2000000; // SSE

    if ( ( CPUInfo[3] & edx ) != edx )
        return false;

    return true;
#else
    return false;
#endif
}

Also, keep in mind that MMX, x87 FPU, and AMD 3DNow!* are all deprecated instruction sets for x64 native, so you shouldn't be using them actively anymore in newer code. A good rule of thumb is to avoid using any intrinsic that returns a __m64 or takes a __m64 data type.

You may want to check out this DirectXMath blog series with notes on many of these instruction sets and the relevant processor support requirements.

Note (*) - All the AMD 3DNow! instructions are deprecated except for PREFETCH and PREFETCHW which were carried forward. First generation Intel64 processors lacked support for these instructions, but they were later added as they are considered part of the core X64 instruction set. Windows 8.1 and Windows 10 x64 require PREFETCHW in particular, although the test is a little odd. Most Intel CPUs prior to Broadwell do not in fact report support for PREFETCHW through CPUID, but they treat the opcode as a no-op rather than throw an 'illegal instruction' exception. As such, the test here is (a) is it supported by CPUID, and (b) if not, does PREFETCHW at least not throw an exception.

Here's some test code for Visual Studio that demonstrates the PREFETCHW test as well as many other CPUID bits for the x86 and x64 platforms.

#include <intrin.h>
#include <stdio.h>
#include <windows.h>
#include <excpt.h>

void main()
{
   unsigned int x = _mm_getcsr();
   printf("%08X\n", x );

   bool prefetchw = false;

   // See http://msdn.microsoft.com/en-us/library/hskdteyh.aspx
   int CPUInfo[4] = {-1};
   __cpuid( CPUInfo, 0 );

   if ( CPUInfo[0] > 0 )
   {
       __cpuid(CPUInfo, 1 );

       // EAX
       {
           int stepping = (CPUInfo[0] & 0xf);
           int basemodel = (CPUInfo[0] >> 4) & 0xf;
           int basefamily = (CPUInfo[0] >> 8) & 0xf;
           int xmodel = (CPUInfo[0] >> 16) & 0xf;
           int xfamily = (CPUInfo[0] >> 20) & 0xff;

           int family = basefamily + xfamily;
           int model = (xmodel << 4) | basemodel;

           printf("Family %02X, Model %02X, Stepping %u\n", family, model, stepping );
       }

       // ECX
       if ( CPUInfo[2] & 0x20000000 ) // bit 29
          printf("F16C\n");

       if ( CPUInfo[2] & 0x10000000 ) // bit 28
          printf("AVX\n");

       if ( CPUInfo[2] & 0x8000000 ) // bit 27
          printf("OSXSAVE\n");

       if ( CPUInfo[2] & 0x400000 ) // bit 22
          printf("MOVBE\n");

       if ( CPUInfo[2] & 0x100000 ) // bit 20
          printf("SSE4.2\n");

       if ( CPUInfo[2] & 0x80000 ) // bit 19
          printf("SSE4.1\n");

       if ( CPUInfo[2] & 0x2000 ) // bit 13
          printf("CMPXCHANG16B\n");

       if ( CPUInfo[2] & 0x1000 ) // bit 12
          printf("FMA3\n");

       if ( CPUInfo[2] & 0x200 ) // bit 9
          printf("SSSE3\n");

       if ( CPUInfo[2] & 0x1 ) // bit 0
          printf("SSE3\n");

       // EDX
       if ( CPUInfo[3] & 0x4000000 ) // bit 26
           printf("SSE2\n");

       if ( CPUInfo[3] & 0x2000000 ) // bit 25
           printf("SSE\n");

       if ( CPUInfo[3] & 0x800000 ) // bit 23
           printf("MMX\n");
   }
   else
       printf("CPU doesn't support Feature Identifiers\n");

   if ( CPUInfo[0] >= 7 )
   {
       __cpuidex(CPUInfo, 7, 0);

       // EBX
       if ( CPUInfo[1] & 0x100 ) // bit 8
         printf("BMI2\n");

       if ( CPUInfo[1] & 0x20 ) // bit 5
         printf("AVX2\n");

       if ( CPUInfo[1] & 0x8 ) // bit 3
         printf("BMI\n");
   }
   else
       printf("CPU doesn't support Structured Extended Feature Flags\n");

   // Extended features
   __cpuid( CPUInfo, 0x80000000 );

   if ( CPUInfo[0] > 0x80000000 )
   {
       __cpuid(CPUInfo, 0x80000001 );

       // ECX
       if ( CPUInfo[2] & 0x10000 ) // bit 16
           printf("FMA4\n");

       if ( CPUInfo[2] & 0x800 ) // bit 11
           printf("XOP\n");

       if ( CPUInfo[2] & 0x100 ) // bit 8
       {
           printf("PREFETCHW\n");
           prefetchw = true;
       }

       if ( CPUInfo[2] & 0x80 ) // bit 7
           printf("Misalign SSE\n");

       if ( CPUInfo[2] & 0x40 ) // bit 6
           printf("SSE4A\n");

       if ( CPUInfo[2] & 0x1 ) // bit 0
           printf("LAHF/SAHF\n");

       // EDX
       if ( CPUInfo[3] & 0x80000000 ) // bit 31
           printf("3DNow!\n");

       if ( CPUInfo[3] & 0x40000000 ) // bit 30
           printf("3DNowExt!\n");

       if ( CPUInfo[3] & 0x20000000 ) // bit 29
           printf("x64\n");

       if ( CPUInfo[3] & 0x100000 ) // bit 20
           printf("NX\n");
   }
   else
       printf("CPU doesn't support Extended Feature Identifiers\n");

   if ( !prefetchw )
   {
       bool illegal = false;

       __try
       {
           static const unsigned int s_data = 0xabcd0123;

           _m_prefetchw(&s_data);
       }
       __except (EXCEPTION_EXECUTE_HANDLER)
       {
           illegal = true;
       }

       if (illegal)
       {
           printf("PREFETCHW is an invalid instruction on this processor\n");
       }
   }
}

UPDATE: The fundamental challenge, of course, is how do you handle systems that lack support for AVX? While the instruction set is useful, the biggest benefit of having an AVX-capable processor is the ability to use the /arch:AVX build switch which enables the global use of the VEX prefix for better SSE/SSE2 code-gen. The only problem is the resulting code DLL/EXE is not compatible with systems that lack AVX support.

As such, for Windows, ideally you should build one EXE for non-AVX systems (assuming SSE/SSE2 only so use /arch:SSE2 instead for x86 code; this setting is implicit for x64 code), a different EXE that is optimized for AVX (using /arch:AVX), and then use CPU detection to determine which EXE to use for a given system.

Luckily with Xbox One, we can just always build with /arch::AVX since it's a fixed platform...

UPDATE 2: For clang/LLVM, you should use slight dikyfferent intriniscs for CPUID:

if defined(__clang__) || defined(__GNUC__)
    __cpuid(1, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
    __cpuid(CPUInfo, 1);
#endif

if defined(__clang__) || defined(__GNUC__)
    __cpuid_count(7, 0, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
    __cpuidex(CPUInfo, 7, 0);
#endif

I don't think it's a good idea to make separate executable for different instruction sets. One executable using a [CPU dispatcher](https://stackoverflow.com/questions/23676426/disable-avx2-functions-on-non-haswell-processors/23677889#23677889) is more ideal in my opinion. — Z boson, May 21 '15 at 07:37
The ``/arch:AVX`` switch applies to an entire module, not just a function, but yes in theory you could create distinct cpp files for each function and compile with different build settings. — Chuck Walbourn, May 21 '15 at 15:33
he main issue is that using virtual functions (or function pointers) adds overhead so it really depends on how much work you are doing in the 'dispatched' functions. This design, was used by the original D3DXMath library. It as easy to optimize for specific CPUs and detect them at runtime, but the result loses a lot of performance for small operations. That is why DirectXMath for Windows uses only SSE and SSE2 so it can be aggressive inlined and has no 'guarded paths' or 'virtual functions' to use. — Chuck Walbourn, May 21 '15 at 15:34
I agree you might not want function pointers for functions that you would normally use static inline with. I'm not sure how to use a dispatcher in that case. In my link above someone mentioned using self modifying code with `memcpy`. That could solve the function pointer problem and still dispatch. I have not done that yet. — Z boson, May 22 '15 at 06:57
Any solution involving self-modifying code is problematic since that's basically indistinguishable from malware. You'd have to create a non-NX memory area and jump to it. — Chuck Walbourn, May 22 '15 at 14:06
Good point, I have not created self modifying code since 68k assembly days. In my case I use the dispatcher for functions where static inline is not useful. E.g. I put several static inline functions in one module and have only one function in the module which is external which I call with the dispatcher. — Z boson, May 22 '15 at 15:08
I'll wait for clang's ifunc support as in recent gcc: [Function Multi Versioning](https://gcc.gnu.org/wiki/FunctionMultiVersioning) — nonsensation, Jun 15 '15 at 22:30

score 4 · Answer 2 · answered May 20 '15 at 17:40

4

As a general rule - don't mix different generations of SSE / AVX unless you have to. If you do, make sure you use vzeroupper or similar state clearing instructions, otherwise you may drag partial values and unknowingly create false dependencies, since most of the registers are shared between the modes Even when clearing, switching between modes may cause penalties, depending on the exact micro architecture.

Further reading - https://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf

answered May 20 '15 at 17:40

Leeor

19,260
5
56
87

To be fair, this mixing issue only occurs when you mix non-VEX code with VEX using code (i.e., mix AVX or AVX2 with earlier 2 operand SSE instructions). Other than that it's fine and necessary to mix between generations - necessary because each extension isn't a complete useful USA by itself but builds on the last one. – BeeOnRope Nov 03 '16 at 13:03
@BeeOnRope, It's ok to mix, but you need to protect yourself from the issue I was talking about. See - http://stackoverflow.com/questions/7839925/using-avx-cpu-instructions-poor-performance-without-archavx – Leeor Nov 06 '16 at 17:50
Yeah exactly, but even that question is a bit unclear since it doesn't point out that almost all interesting SSE instructions have a VEX-encoded version - including 128-bit variants. Many people still loosely call those "SSE" instructions. For example, you'll hear that `pshufb` is an `SSE3` instruction. Indeed an invocation like `phsufb xmm0, xmm1` creates an non-VEX encoding that would cause the performance problem you mention. Simply changing that to the identically behaved `pshufb xmm0, xmm0, xmm1`, however, changes it to VEX-encoding and avoids the issue. – BeeOnRope Nov 06 '16 at 18:47

score 3 · Answer 3 · answered Nov 03 '16 at 04:28

See Chuck's answer for good advice on what you should do. See this answer for a literal answer to the question asked, in case you're curious.

AVX support absolutely guarantees support for all Intel SSE* instruction sets, since it includes VEX-encoded versions of all of them. As Chuck points out, you can check for previous ones at the same time with a bitmask, without bloating your code, but don't sweat it.

Note that POPCNT, TZCNT, and stuff like that are not part of SSE-anything. POPCNT has its own feature bit. LZCNT has its own feature bit, too, since AMD introduced it separately from BMI1. TZCNT is just part of BMI1, though. Since some BMI1 instructions use VEX encodings, even latest-generation Pentium/Celeron CPUs (like Skylake Pentium) don't have BMI1. :( I think Intel just wanted to omit AVX/AVX2, probably so they could sell CPUs with faulty upper-lanes of execution units as Pentiums, and they do this by disabling VEX support in the decoders.

Intel SSE support has been incremental in all CPUs released so far. SSE4.1 implies SSSE3, SSE3, SSE2, and SSE. And SSE4.2 implies all of the preceding. I'm not sure if any official x86 documentation precludes the possibility of a CPU with SSE4.1 support but not SSSE3. (i.e. leave out PSHUFB, which is possibly expensive-to-implement.) It's extremely unlikely in practice, though, since this would violate many people's assumptions. As I said, it might even be officially forbidden, but I didn't check carefully.

AVX does not include AMD SSE4a or AMD XOP. AMD extensions have to be checked-for specially. Also note that the newest AMD CPUs are dropping XOP support. (Intel never adopted it, so most people don't write code to take advantage of it, so for AMD those transistors are mostly wasted. It does have some nice stuff, like a 2-source byte permute, allowing a byte LUT twice as wide as PSHUFB, without the in-lane limitation of AVX2's VPSHUFB ymm).

SSE2 is baseline for the x86-64 architecture. You do not have to check for SSE or SSE2 support in 64-bit builds. I forget if MMX is baseline, too. Almost certainly.

The SSE instruction set includes some instructions that operate on MMX registers. (e.g. PMAXSW mm1, mm2/m64 was new with SSE. The XMM version is part of SSE2.) Even a 32-bit CPU supporting SSE needs to have MMX registers. It would be madness to have MMX registers but only support the SSE instructions that use them, not the original MMX instructions (e.g. movq mm0, [mem]). However, I haven't found anything definitive that rules out the possibility of an x86-based Deathstation 9000 with SSE but not MMX CPUID feature bits, but I didn't wade into Intel's official x86 manuals. (See the x86 tag wiki for links).

Don't use MMX anyway, it's generally slower even if you only have 64 bits at a time to work on, in the low half of an XMM register. The latest CPUs (like Intel Skylake) have lower throughput for the MMX versions of some instructions than for the XMM version. In some cases, even worse latency. For example, according to Agner Fog's testing, PACKSSWB mm0, mm1 is 3 uops, with 2c latency, on Skylake. The 128b and 256b XMM / YMM versions are 1 uop, with 1c latency.

Are older SIMD-versions available when using newer ones?

3 Answers3

Linked