3

This is what I know about SIMD. Single-instruction-multiple-data is a way of processing data that performs the same instruction over vectors of multiple values. SIMD is implemented at different levels depending on the processor of the machine (SSE, SSE2, NEON...), and every level provides a different instruction set.

We can use these instructions sets by including immintrin.h. What I haven't really understood is: when actually developing something with SIMD, should we care about checking which instruction sets are supported? What are the best practices when developing such programs? What should we do if, for example, an instruction set is not supported; should we provide a non-SIMD alternative or the compiler unvectorises the whole thing for us?

phuclv
  • 37,963
  • 15
  • 156
  • 475
Giuppox
  • 1,393
  • 9
  • 35
  • 1
    `immintrin.h` is specifically Intel, not ARM NEON. (And not AMD-specific intrinsics like SSE4a or XOP for Bulldozer-family). – Peter Cordes Dec 19 '21 at 11:44
  • @PeterCordes ow, I see, mine's Intel, that's why I never noticed. Is there a way to include AMD ones? – Giuppox Dec 19 '21 at 12:46
  • Yes, x86 compilers come with header files that include AMD stuff, but they're basically obsolete since AMD has dropped support for XOP in Zen. Modern AMD CPUs support AVX2+FMA3 and BMI / BMI2, I just meant that `immintrin.h` only defines intrinsics for extensions that Intel *introduced*, not that they couldn't run on AMD CPUs. – Peter Cordes Dec 19 '21 at 13:27

2 Answers2

3

Of course we need to take care which ISA is supported, because if we use an unknown instruction then the program will be killed with a non-supported instruction signal. Besides it allows us to optimize for each architecture, for example on CPUs with AVX-512 we can use AVX-512 for better performance, but if on an older CPU then we can fallback to the appropriate version for that architecture

What are the best practices when developing such programs?

There are no general best practices. It depends on each situation because each compiler has different tools for this

  • If your compiler doesn't support dynamic dispatching then you need to write separate code for each ISA and call the corresponding version for the current platform
  • Some compilers automatically dispatch to the version optimized for the running platform, for example ICC can compile a hot loop to separate versions of SSE/AVX/AVX-512 and jump to the correct version for maximum performance.
  • Some other compilers support compiling to separate versions of a single function and automatically dispatch but you need to specify which function you want to optimize. For example in GCC, Clang and ICC you can use the attributes target and target_clones. See Building backward compatible binaries with newer CPU instructions support
phuclv
  • 37,963
  • 15
  • 156
  • 475
  • Hi, thanks, is there a way for checking which ISA is supported? – Giuppox Dec 19 '21 at 11:23
  • @Giuppox again it depends on the compiler and platform and also depending on compile-time or runtime check: [Determine processor support for SSE2?](https://stackoverflow.com/q/2403660/995714), [How to check if a CPU supports the SSE3 instruction set?](https://stackoverflow.com/q/6121792/995714), [Detect the availability of SSE/SSE2 instruction set in Visual Studio](https://stackoverflow.com/q/18563978/995714)... – phuclv Dec 19 '21 at 12:37
  • @Giuppox CPUs have an instruction to query supported instruction set extensions, `cpuid`. Unfortunately, compilers do that slightly differently, the instruction is exposed as `__cpuid` but Microsoft disagree with gcc/clang about the prototype, if you want to support multiple compilers you gonna need `#ifdef`. Or find a library, pretty sure there’re plenty. – Soonts Dec 19 '21 at 15:53
  • @Soonts hi, how do I access these values, are they macros? I couldn't find anything online. – Giuppox Dec 19 '21 at 16:15
  • @Giuppox VC++ https://learn.microsoft.com/en-us/cpp/intrinsics/cpuid-cpuidex?view=msvc-170 GCC and clang https://wiki.osdev.org/CPUID#Using_CPUID_from_GCC The meaning of these bits are in Wikipedia https://en.wikipedia.org/wiki/CPUID most of them are under “Processor Info and Feature Bits” and some newer ones are in “Extended Features” – Soonts Dec 19 '21 at 16:22
  • @Soonts I've found [this](https://sites.uclouvain.be/SystInfo/usr/include/cpuid.h.html) library that seems pretty good. However that inline assembly doesn't seem much portable. Is it safe too use? – Giuppox Dec 20 '21 at 09:52
  • @Giuppox Depends on your target platforms. If you only develop for Linux or OSX and building with gcc or clang, inline assembly is fine. If however you’re on windows and building with vc++, integrating assembly is PITA, in which case you better use the `__cpuid()` intrinsic. Or find another CPU feature detection library which uses intrinsics under the hood. – Soonts Dec 20 '21 at 09:56
  • @Soonts my library is supposed to work at least both on unix and windows, and should compile on every major compiler. Is there a way to achieve this safely? – Giuppox Dec 20 '21 at 10:04
  • @Giuppox I would use `__cpuid` intrinsic. You only need `#ifdef _MSC_VER` to emit the cpuid instruction (also to include different header defining that intrinsic). The code to check these bits in the output registers is compiler-agnostic. – Soonts Dec 20 '21 at 10:14
  • @Soonts doesn't that library handle it? It seems to be defining `__cpuid` if not present using inline assembly, otherwise it uses the intrinsic builtin one. – Giuppox Dec 20 '21 at 10:23
  • @Giuppox The source you have linked is part of gcc and clang compilers, it even comes with GPL license. Not gonna make on Windows with vc++ compiler. On Linux, you should conditionally include that header for the implementation of `__cpuid`. On Windows, include `` instead, and use different `__cpuid` intrinsic. Then check these bits. You can copy-paste pieces from there https://github.com/steinwurf/cpuid/tree/master/src/cpuid/detail however the complete library is not good IMO, way too complicated. – Soonts Dec 20 '21 at 11:43
  • @Soonts this is my implementation if you're interested https://github.com/Giuppox/block/blob/main/block/core/cpu.h – Giuppox Dec 21 '21 at 09:48
  • @Giuppox I think your VC++ version not gonna work. These 4 numbers are output parameters. You need to copy the result to the 4 input pointers, after the `__cpuid` intrinsic. You’re doing the opposite, copying current values of the 4 numbers, and discarding the outputs of `cpuid` instruction. – Soonts Dec 21 '21 at 11:21
1

should we care about checking which instruction sets are supported?

Usually yes, but not always. If you compile 64-bit code for PCs, you’re guaranteed to have SSE1 and SSE2, these two are part of the AMD64 instruction set, guaranteed to be supported.

What are the best practices when developing such programs?

Negotiate with people about minimum hardware requirements for the software you’re working on. If you don’t have boss, client, nor users, find some stats and try to make educated guesses. Steam has a nice stats for PC gamers who have their software installed, expand “other settings” and you’ll see percentage of global users with specific instruction set.

Personally, I think now in 2021 it’s generally OK to require SSE up to and including SSE 4.1, and fail at startup if not supported. Assuming you do that gracefully, i.e. write that in hardware requirements, and in runtime show a comprehensible error message to end users about unsupported CPU.

should we provide a non-SIMD alternative

99% of new computers sold in the last decade have at least 4GB RAM, and a 64-bit OS. I think for most projects it’s OK to only ship 64-bit binaries, this gives you SSE 1 and 2, no need for scalar alternatives.

Sometimes, when I need to support SSE-only CPUs but AVX brings too much profit in terms of performance, I indeed implementing couple alternatives, and a runtime dispatch.

Soonts
  • 20,079
  • 9
  • 57
  • 130