Intrinsic Fused-Multiply-Add : Can I use same variables for input and output?

Question

Can I use the same variable (here _res_) as input and output for _mm256_fmadd_ps to act as an accumulator or will it cause undefined behaviour because I am reading and writing to the same registry at the same time?

__m256 A1, B1;
__m256 A2, B2;
__m256 A3, B3;
/* Init all As and Bs */

__m256 _res = {};
_res = _mm256_fmadd_ps(A1, B1, _res); // _res += A1 * B1
_res = _mm256_fmadd_ps(A2, B2, _res); // _res += A2 * B2
_res = _mm256_fmadd_ps(A3, B3, _res); // _res += A3 * B3

Re "Can I use same variables for input and output?" Yes, you can. — njuffa, Oct 25 '22 at 23:33
On all CPUs I know, it is perfectly valid to use the same register for both input and output in an instruction. Certainly on x86. And even if it weren't, it's legal in C to assign to a variable a function of itself, so your compiler would use different registers if it were necessary. — Nate Eldredge, Oct 26 '22 at 01:24

score 3 · Accepted Answer · answered Oct 26 '22 at 02:50

Yes, that's perfectly normal; intrinsics functions work like pure functions that take numeric args by value, just as they're defined and documented.

If there were any asm quirks to worry about, the compiler would take care of them; that's the benefit of using intrinsics in C instead of hand-written asm.

In this case there aren't any problems; with vfmadd231ps ymm0, ymm1, ymm2, it can make the addend the destination of the FMA. As opposed to vfmadd132ps or vfmadd213ps, where one of the multiplicands is the input/output register.

I would recommend __m256 _res = _mm256_setzero_ps(); instead of an empty brace initializer. Or of course __m256 res = _mm256_mul_ps(A1, B1) so the compiler doesn't have to materialize a 0.0f vector and FMA into it.

Also, don't use leading underscores in your own local var names. It's not good style, and it's very close to names that are reserved for the implementation. _res is reserved at global scope, and _Res would be reserved everywhere, might be a macro.

Where the compiler does need to do extra work is if you weren't updating one of the inputs to become the result, and you used the original value of all 3 inputs later. Then the compiler would need an extra register copy instruction (vmovaps) to keep the original value around after vfmadd replaces one of its input registers with the result.

(The 3 forms give the compiler a choice of whether to replace addend or a multiplicand. And if a multiplicand, whether the last operand which can be memory or register is the addend or another multiplicand. vfmadd213ps op1, op2, op3 does fma(op2, op1, op3), hence the numbering scheme.)

Another case where the compiler would need an extra register-copy would be compiling v = _mm_andnot_ps(mask, v) without AVX, so it can only use 2-operand instructions like andnps xmm0, xmm1, which replace xmm0 with the result. It's not commutative, so if we want to not destroy mask, we need to copy it first. And/or if we need v in the same register as before (e.g. because we're in a tight loop that we're not unrolling), then we'd need also need a movaps for that.

The only reason to think about asm details like this is to minimize register-copy instructions, e.g. for cases like this, or in your choice of shuffles, as in my answer Fastest way to do horizontal SSE vector sum (or other reduction).

For example, SSE2 pshufd is a copy-and-shuffle, but punpckhqdq is a 2-input shuffle so it replaces its first operand. If you want to broadcast the top qword of an integer vector and add, hi = _mm_shuffle_epi32(v, _MM_SHUFFLE(3,2,3,2)) can avoid a movdqa vs. hi = _mm_unpackhi_epi64(v,v) if you're compiling without AVX.

Note that what really matters is the pattern of data dependencies, not whether you invent a new temporary and then assign it. That makes no difference to an optimizing compiler. res = fma(a,b,res) is handled internally by the compiler like res2 = fma(a,b,res1) when it converts the program logic into SSA form, which all the major optimizing ahead-of-time compilers use internally. e.g. LLVM-IR (clang) or GIMPLE (GCC).

Intrinsic Fused-Multiply-Add : Can I use same variables for input and output?

1 Answers1