Yes, that's perfectly normal; intrinsics functions work like pure functions that take numeric args by value, just as they're defined and documented.
If there were any asm quirks to worry about, the compiler would take care of them; that's the benefit of using intrinsics in C instead of hand-written asm.
In this case there aren't any problems; with vfmadd231ps ymm0, ymm1, ymm2
, it can make the addend the destination of the FMA. As opposed to vfmadd132ps
or vfmadd213ps
, where one of the multiplicands is the input/output register.
I would recommend __m256 _res = _mm256_setzero_ps();
instead of an empty brace initializer. Or of course __m256 res = _mm256_mul_ps(A1, B1)
so the compiler doesn't have to materialize a 0.0f
vector and FMA into it.
Also, don't use leading underscores in your own local var names. It's not good style, and it's very close to names that are reserved for the implementation. _res
is reserved at global scope, and _Res
would be reserved everywhere, might be a macro.
Where the compiler does need to do extra work is if you weren't updating one of the inputs to become the result, and you used the original value of all 3 inputs later. Then the compiler would need an extra register copy instruction (vmovaps
) to keep the original value around after vfmadd
replaces one of its input registers with the result.
(The 3 forms give the compiler a choice of whether to replace addend or a multiplicand. And if a multiplicand, whether the last operand which can be memory or register is the addend or another multiplicand. vfmadd213ps op1, op2, op3
does fma(op2, op1, op3)
, hence the numbering scheme.)
Another case where the compiler would need an extra register-copy would be compiling v = _mm_andnot_ps(mask, v)
without AVX, so it can only use 2-operand instructions like andnps xmm0, xmm1
, which replace xmm0 with the result. It's not commutative, so if we want to not destroy mask
, we need to copy it first. And/or if we need v
in the same register as before (e.g. because we're in a tight loop that we're not unrolling), then we'd need also need a movaps
for that.
The only reason to think about asm details like this is to minimize register-copy instructions, e.g. for cases like this, or in your choice of shuffles, as in my answer Fastest way to do horizontal SSE vector sum (or other reduction).
For example, SSE2 pshufd
is a copy-and-shuffle, but punpckhqdq
is a 2-input shuffle so it replaces its first operand. If you want to broadcast the top qword of an integer vector and add, hi = _mm_shuffle_epi32(v, _MM_SHUFFLE(3,2,3,2))
can avoid a movdqa
vs. hi = _mm_unpackhi_epi64(v,v)
if you're compiling without AVX.
Note that what really matters is the pattern of data dependencies, not whether you invent a new temporary and then assign it. That makes no difference to an optimizing compiler. res = fma(a,b,res)
is handled internally by the compiler like res2 = fma(a,b,res1)
when it converts the program logic into SSA form, which all the major optimizing ahead-of-time compilers use internally. e.g. LLVM-IR (clang) or GIMPLE (GCC).