You are really asking two different questions:
(1) How does the compiler decide where to put my SIMD variables? In memory or in register?
(2) How specific is the 'contract' for an intrinsic? Does it always emit a specific instruction?
The answer to the first question is really no different for SIMD than for any other kind of variable. In C/C++, you usually use automatic variables because those are the most likely to end up in register. The compiler is free to schedule the actual instructions and register usage based on the context, and will often move data in and out of registers to 'stack memory' depending on how much 'register pressure' there is in code.
This flexibility is a "good thing" compared to writing it in assembly where you the programmer decide exactly what registers are being used when and exactly what order the instructions are executed in. Often the compiler can mix in other nearby code or do other optimizations that are difficult to keep straight, and it can take advantage of architecture differences. For example, in DirectXMath I have written the same intrinsics code for both x86 (32-bit) and x64 (64-bit), and the compiler can make use of the 8 extra registers available in x64. If I was using inline assembly, I'd have to write it two different ways and probably more than that with some additional differences I'll come to shortly.
When writing SIMD code, you really want to maximize the work with data already in a register because the load/store overhead to memory often costs as much performance as you get from doing a few SIMD instructions vs. scalar. As such, you will usually write SIMD intrinsics to do an explicit load into a bunch of 'automatic variables' but keep in mind that likely only 8 or so of them are going to really be in a register at a time. You do want to do enough work that the compiler can fill in the gaps. You then store the result to memory. As such, you really don't do stuff like auto a = new __m128d;
. There's also the additional complexity of the implied aligment (__m128d
must be 16-byte aligned, and while x64 new
does that x86 new
does not).
The second answer is a bit more complicated. A given intrinsic is usually defined as a given instruction, and some instrinsics are really combos of instructions, but the compiler may choose to use some knowledge of the target platform when picking the exact instruction. Here are a few examples:
__m128 _mm_add_ps (__m128 a, __m128 b)
is defined as the SSE instruction addps
and is often emitted as such. But if you are building with /arch:AVX
or /arch:AVX2
the compiler will use the VEX prefix and the instruction vaddps
.
__m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c)
is defined as the FMA3 instruction but the compiler can actually emit vfmadd132pd
, vfmadd213pd
, or vfmadd231pd
depending on the exactly register use. In fact, the compiler can even just decide it's faster to use a vmulpd
followed by a vaddpd
which does the same thing depending on the exact instruction timing of the hardware instruction cost functions it is using.
Note that while it is certainly possible for the compiler implementer to decide say that they could optimize __m128 _mm_shuffle_ps (__m128 a, __m128 b, unsigned int imm8)
where the registers a and b are the same and choose emit a vpermilps
instead of a shufps
if you are building with /arch:AVX
. That would be 'in contract' with the intrinsic. In practice, however, intrinsics tend to be treated a bit special and strongly prefer the instruction they are defined as because you often use them in particular contexts based on hardware feature detection. So you normally can count on a particular instrinic to end up being the instruction you expect or a very close variant of it.
So in short, all of C/C++ is a 'hint' to the compiler in the sense that the source code describes the exact computation you want, but the compiler is free to actually emit code that achieves the same result but can be in different order or use different instructions than the ones you might assume.
The Intel Intrinsics Guide is a good resource for exploring intrinsics.
You might also find some of my blog posts related to intrinsics useful.
The DirectXMath Programmer's Guide also has some useful tricks & tips for intrinsics usage sprinkled throughout so it's worth a read and it's only 6 pages so it won't take that long. See Microsoft Docs