I'm using Visual Studio 2015, building x64 code, and working with floating point vectors of four ABGR pixel values, i.e. with the Alpha (opacity) in the most significant position and Blue, Green, and Red numbers in the lower three positions.
I'm trying to work up a PreMultiplyAlpha routine that will inline/__vectorcall to do an efficient job of premultiplying the alpha into Blue, Green, and Red and leave the Alpha value set to 1.0f when done.
The actual multiplication is no problem. This propagates the Alpha across all four elements then multiplies them all.
__m128 Alpha = _mm_shuffle_ps(Pixel, Pixel, _MM_SHUFFLE(3, 3, 3, 3));
__m128 ReturnPixel = _mm_mul_ps(Pixel, Alpha);
With the above the alpha is multiplied into all the colors with a minimum of instructions:
shufps xmm1, xmm0, 255 ; 000000ffH
mulps xmm1, xmm0
It's a great start, right?
Then I hit a brick wall... I've not discovered a direct way - or even a tricky way - to do what seems like should be a reasonably simple act of efficiently setting the most significant element (Alpha) to 1.0f. Maybe I just have a blind spot.
The most obvious method causes VC++ 2015 to create machine code that does two 128 bit memory accesses:
ReturnPixel.m128_f32[ALPHA] = 1.0f;
The above generates code like this, which saves the whole pixel on the stack, overwrites the Alpha, then loads it back from the stack:
movaps XMMWORD PTR ReturnPixel$1[rsp], xmm1
mov DWORD PTR ReturnPixel$1[rsp+12], 1065353216 ; 3f800000H
movaps xmm1, XMMWORD PTR ReturnPixel$1[rsp]
I'm a big fan of keeping the code as straightforward as possible for human maintainers to understand, but this particular routine is used a lot and needs to be made optimally fast.
Other things I've tried seem to lead the compiler to make more instructions (and especially memory accesses) than should be necessary...
This attempts to move the A position into the least significant word, replace it with 1.0f, then move it back. It's pretty good, but it does go fetch a single 32 bit 1.0f from a memory location.
ReturnPixel = _mm_shuffle_ps(ReturnPixel, ReturnPixel, _MM_SHUFFLE(0, 2, 1, 3));
ReturnPixel = _mm_move_ss(ReturnPixel, _mm_set_ss(1.0f));
ReturnPixel = _mm_shuffle_ps(ReturnPixel, ReturnPixel, _MM_SHUFFLE(0, 2, 1, 3));
With that I got these instructions:
movss xmm0, DWORD PTR __real@3f800000
movaps xmm1, xmm0
shufps xmm2, xmm2, 39 ; 00000027H
movss xmm2, xmm1
shufps xmm2, xmm2, 39
Any ideas how to leave 1.0f in the A field (most significant element) with a minimum of instructions and ideally no additional memory accesses beyond what's fetched from the instruction stream? I even thought about dividing the vector by itself to achieve 1.0f in all positions, but I'm allergic to divides as they're inefficient to say the least...
Thanks in advance for your ideas. :-)
-Noel