SSE set of instructions for crossplatform

Question

I want to write math function with SSE instructions in the VS2017. I could try it:

__m128 addWithIntrinsics(__m128 a, __m128 b)
{
    __m128 r = _mm_add_ps(a, b);
    return r;
}

__m128 addWithAssembly(__m128 a, __m128 b)
{
    __m128 r;
    __asm
    {
        movaps xmm0, xmmword ptr[a]
        movaps xmm1, xmmword ptr[b]
        addps xmm0, xmm1
        movaps xmmword ptr[r], xmm0
    }
    return r.
}

But I’m not sure ... If I write mathematical operations like this, will this code be cross-platform (in terms of working only on Windows, but on different processors and those that do not support SSE), or will I need to determine at the compilation stage whether Processor these instructions and if not then use the usual? What is the best way for me to do this, and which of my two variants is preferable?

MSVC inline asm is total garbage for this. You don't want your data to be stored/reloaded between every operation! See [What is the difference between 'asm', '\_\_asm' and '\_\_asm\_\_'?](//stackoverflow.com/q/3323445) Intrinsics are by *far* the best choice for x86 SIMD. — Peter Cordes, Feb 26 '19 at 08:21
Also re inline asm in MSVC: it's not supported on x64, so you would be limiting yourself to 32 bit code. — Paul R, Feb 26 '19 at 09:12
@Paur R, Do intrinsics support on x64? I just also need the application to work under x64. How to deal with this? — QuickDzen, Feb 26 '19 at 09:21
Of course intrinsics are supported on x86-64. That's by far the best way to use SSE/AVX from C, because compilers are fairly good at optimizing them. — Peter Cordes, Feb 26 '19 at 18:34
As I understand it uses SSE for the data types float? Or can they also be used successfully for other types of data? — QuickDzen, Mar 06 '19 at 06:08

user1118321 · Accepted Answer · 2019-02-26T06:30:50.763

2

If you want to be able to run on processors without SSE, you'll need to write both versions – with and without. You'll need to check at runtime whether the current machine you're running on supports SSE and use the appropriate function based on the result.

As for which is better – that's a matter of taste. I prefer to program in C++ so I'd prefer the intrinsics version. But if you work with a bunch of assembly programmers, they'd probably prefer the assembly version.

edited Feb 26 '19 at 06:30

answered Feb 26 '19 at 06:28

user1118321

25,567
4
55
86

2

Relevant questions: [How to check if a CPU supports the SSE3 instruction set?](https://stackoverflow.com/questions/6121792/how-to-check-if-a-cpu-supports-the-sse3-instruction-set) and [Detect the availability of SSE/SSE2 instruction set in Visual Studio](https://stackoverflow.com/questions/18563978/detect-the-availability-of-sse-sse2-instruction-set-in-visual-studio). – Daniel Langr Feb 26 '19 at 06:29
1

Thanks for the info! I added a link to the first one inline in my answer. – user1118321 Feb 26 '19 at 06:31
It turns out if the processor does not support SSE, then the production on these processors will only fall due to unnecessary checks in runtime? In general, many of these processors that do not support SSE as a percentage of all processors? – QuickDzen Feb 26 '19 at 06:40
1

@QuickDzen: If the CPU doesn't support SSE3, then maybe it's ARMv8 and has its own SIMD extensions. The only way to make it portable is to avoid "vendor specific non-standards" (e.g. Intel's intrinsics) and write pure C++ code (and hope the compiler's optimizer can auto-vectorize, and then be disappointed when you find out it can't). The only other alternative is to embrace non-portable code (e.g. write multiple versions of non-portable code guarded by conditional selection). – Brendan Feb 26 '19 at 07:58
2

@QuickDzen: Worth mentioning that SSE2 is baseline for x86-64, so simply making 64-bit binaries avoids the need to check if SSE2 is a useful-enough baseline. – Peter Cordes Feb 26 '19 at 08:23
2

@user1118321: **No, it's not a matter of taste. The MSVC inline asm version is total garbage**, and will force the compiler to store/reload your data between every `addps`, defeats many other optimizations, and forces register allocation to use hard-coded XMM0 and XMM1, making it impossible for it to have independent ADDPS operations happening without a lot of extra MOVAPS instructions. Or since nothing is being kept in registers anyway because of memory input/output, that last point is actually not an issue... – Peter Cordes Feb 26 '19 at 08:27
1

But yeah I'd expect probably one or 2 extra store/reloads of the data even if this was just used to loop over an array doing `a[i] += b[i]`, without doing anything where you *want* data to stay in registers for a chain of operations. Also, MSVC inline asm only works in 32-bit mode. It's so bad they dropped it for 64-bit. (Apparently the internal implementation in MSVC was nasty and broke easily.) – Peter Cordes Feb 26 '19 at 08:27
@PeterCordes Good point! I didn't mean to imply that specific code in the question was good (or bad), just that in general, whether you use assembly or intrinsics depends on a number of factors. It probably doesn't make sense to have just an add function by itself in any case. I would personally never try to write assembly, but others may be better at it than me and my compiler for their specific use case. – user1118321 Feb 26 '19 at 16:42
MSVC's simplistic inline asm design makes it impossible for inline asm to be good for short blocks, because of bouncing inputs through memory. (GNU C inline asm doesn't have this problem: [What is the difference between 'asm', '\_\_asm' and '\_\_asm\_\_'?](//stackoverflow.com/q/3323445), but I still definitely recommend and use intrinsics even though I *do* know how to write asm that's at least as optimal as the compiler's, and sometimes have to tweak the C source to get the compiler to make better asm. Future CPUs will be different, and asm defeats constant-propagation and so on.) – Peter Cordes Feb 26 '19 at 18:39
MSVC inline asm overhead is not a big deal if you write a whole loop in inline asm, though. Compilers are generally good at x86 SIMD intrinsics (unlike ARM compilers currently), so there's no reason to do it unless you're writing a whole loop to work around a compiler missed-optimization. But at that point you might just write a whole function in asm. – Peter Cordes Feb 26 '19 at 18:40

SSE set of instructions for crossplatform

1 Answers1