Cross-platform SIMD calls possible with only one executable?

Question

I have recently picked up an interest in SIMD optimization after wanting to program again in C++ after a while of not doing so. Please, be descriptive as I am still a beginner with SIMD instructions.

My question is : is it possible to compile one cross-platform executable in C++ that support a variety of SIMD instruction sets and that picks in real time the best instruction set to use? Best in terms of performance, usually most recent instruction sets are better.

Example : I compile a game on Windows 10 with an i7-7700K and put it on Steam. Distinct users highly probably have different CPUs that support different SIMD instruction sets. When launching the game, the best SIMD instruction set is detected and used.

Naturally, I would have to adapt my code and support a few hand selected SIMD instruction sets.

If it is possible, do you know a good and efficient library that deals with wrapping and using transparently different SIMD instruction sets? — Benoît Dubreuil, Jun 27 '18 at 14:49
Possible duplicates: [Dispatching SIMD instructions + SIMDPP + qmake](https://stackoverflow.com/q/39484012/253056); [Use different class upon available CPU support](https://stackoverflow.com/q/27237650/253056); [x86 CPU Dispatching for SSE/AVX in C++](https://stackoverflow.com/q/4788592/253056) — Paul R, Jun 27 '18 at 14:55
It's not only possible, Intel and AMD both sell math libraries that do this (and the exact conditions under which each implementation is selected have been reverse-engineered by [people interested in questions like "Is the Intel math library slow on AMD processors because the AMD chips don't run the code as fast, or because Intel's code selection logic is broken?"](http://www.agner.org/optimize/blog/read.php?i=121)) — Ben Voigt, Jun 27 '18 at 15:27
I believe Agnor Fog's VCL is supposed to do what you ask but with c. Dunno about c++. — Simon Goater, Nov 17 '22 at 13:07
VCL will only do it at compile time. OpenCL can do it at runtime but the client needs to install OpenCL. To do it yourself, you probably just have to use CPUID and code for all architectures. — Simon Goater, Nov 17 '22 at 13:16

score 2 · Accepted Answer · answered Jun 28 '18 at 06:35

Generally the issue is what level of granularity you want to use SIMD... Older math libraries like D3DXMath use indirect jump (i.e. virtual methods) to select at runtime a version of the function that is optimized for that instruction set. While this works in theory, the function has to do enough work to cover the overhead of the indirect call.

For example: If you call D3DXVec3Dot and it selects a different version for SSE/SSE2, SSE3, or SSE4.1 most likely the cost of calling the function in the first place is more expensive that the performance savings. To really get a benefit from this kind of optimization, you need to have larger scale routines that do thousands of computations at once rather than micro-functions.

Note that this is why DirectXMath is an all inline library that doesn't use indirect jump/dispatch at all. You can count on SSE/SSE2 always being supported for x64, and it's basically always supported for x86. If you happen to be building an EXE/DLL for a platform that always has AVX (such as Xbox One), then use /arch:AVX and the DirectXMath library will use AVX, SSE4.1, SSE3, SSE2/SSE where it makes sense. See this blog post series.

Ohhh ok so it's feasible dynamically. Thank you for the links too! Your post gave me some inspirations to other problems. — Benoît Dubreuil, Jun 28 '18 at 15:58

Cross-platform SIMD calls possible with only one executable?

1 Answers1