0

I'm using Visual Studio 2015, building x64 code, and working with floating point vectors of four ABGR pixel values, i.e. with the Alpha (opacity) in the most significant position and Blue, Green, and Red numbers in the lower three positions.

I'm trying to work up a PreMultiplyAlpha routine that will inline/__vectorcall to do an efficient job of premultiplying the alpha into Blue, Green, and Red and leave the Alpha value set to 1.0f when done.

The actual multiplication is no problem. This propagates the Alpha across all four elements then multiplies them all.

__m128 Alpha = _mm_shuffle_ps(Pixel, Pixel, _MM_SHUFFLE(3, 3, 3, 3));
__m128 ReturnPixel = _mm_mul_ps(Pixel, Alpha);

With the above the alpha is multiplied into all the colors with a minimum of instructions:

shufps  xmm1, xmm0, 255             ; 000000ffH
mulps   xmm1, xmm0

It's a great start, right?

Then I hit a brick wall... I've not discovered a direct way - or even a tricky way - to do what seems like should be a reasonably simple act of efficiently setting the most significant element (Alpha) to 1.0f. Maybe I just have a blind spot.

The most obvious method causes VC++ 2015 to create machine code that does two 128 bit memory accesses:

ReturnPixel.m128_f32[ALPHA] = 1.0f;

The above generates code like this, which saves the whole pixel on the stack, overwrites the Alpha, then loads it back from the stack:

movaps  XMMWORD PTR ReturnPixel$1[rsp], xmm1
mov     DWORD PTR ReturnPixel$1[rsp+12], 1065353216 ; 3f800000H
movaps  xmm1, XMMWORD PTR ReturnPixel$1[rsp]

I'm a big fan of keeping the code as straightforward as possible for human maintainers to understand, but this particular routine is used a lot and needs to be made optimally fast.

Other things I've tried seem to lead the compiler to make more instructions (and especially memory accesses) than should be necessary...

This attempts to move the A position into the least significant word, replace it with 1.0f, then move it back. It's pretty good, but it does go fetch a single 32 bit 1.0f from a memory location.

ReturnPixel = _mm_shuffle_ps(ReturnPixel, ReturnPixel, _MM_SHUFFLE(0, 2, 1, 3));
ReturnPixel = _mm_move_ss(ReturnPixel, _mm_set_ss(1.0f));
ReturnPixel = _mm_shuffle_ps(ReturnPixel, ReturnPixel, _MM_SHUFFLE(0, 2, 1, 3));

With that I got these instructions:

movss   xmm0, DWORD PTR __real@3f800000
movaps  xmm1, xmm0
shufps  xmm2, xmm2, 39              ; 00000027H
movss   xmm2, xmm1
shufps  xmm2, xmm2, 39

Any ideas how to leave 1.0f in the A field (most significant element) with a minimum of instructions and ideally no additional memory accesses beyond what's fetched from the instruction stream? I even thought about dividing the vector by itself to achieve 1.0f in all positions, but I'm allergic to divides as they're inefficient to say the least...

Thanks in advance for your ideas. :-)

-Noel

NoelC
  • 1,407
  • 10
  • 16
  • 2
    What if you do `&` with a bitmask e.g. `11111110000000` then `|` with whatever would represent a float of 1? – wally Apr 29 '16 at 16:23
  • Assuming SSE4.1 then take a look at [`_mm_insert_ps`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=insert_ps&expand=2936) ? (Oops - just saw the SSE2 requirement - that's a bit harsh...) – Paul R Apr 29 '16 at 17:00
  • 1
    @PaulR: `_mm_blend_ps` is actually the most efficient way. `insertps` can only run on the shuffle port. But with SSE2, @ flatmouse's suggestion appears to be the best. Two insns instead of one, and great as long as your compiler can hoist the setup of constants out of the loop. – Peter Cordes Apr 29 '16 at 20:56

2 Answers2

3

The 1.0 float constant has to come from somewhere, so it has to either be loaded or generated on the fly. There's no SSE equivalent of fld1, and compilers usually go for fewer instructions even at the risk of a D-cache miss instead of mov eax, 0x3f800000 / movd xmm0, eax or something. (See Agner Fog's Optimizing Assembly, section 13.4 for a table of sequences. Generating 1.0 takes 3 insns).


The is no SSE/SSE2 single instruction that can replace a 32b element of a vector (other movss for the low element). SSE4.1 introduced insertps and pinsrd. Using two pinsrw instructions to set 16b at a time is unlikely to be the best option, esp. if you want to feed that vector into an FP computation.

If you want to store it, then probably two overlapping stores are best: store the 16B vector with the wrong data, then store a 1.0. A smart compiler would in theory compile it to shufps-broadcast / mulps / movaps [mem], xmm1 / mov [mem+12], 0x3f800000. If you do a vector load right away from [mem], though, you'll cause a store-forwarding stall. (another ~10 cycles of latency above the normal ~5c for a store/reload round trip on typical uarches)


Dealing with constants

Since you're processing pixels, I assume that means this happens in a loop with many iterations. That means we're optimizing for efficiency in the loop, even if that means some extra setup outside the loop.

A good compiler will hoist constants out of loops after inlining, so it should be fine to factor the operation out into a function that uses _mm_set_ps or _mm_set1_ps for its constants. You should check the asm, though; MSVC doesn't always manage to do this, so you may have to inline and hoist manually.


In registers, in preparation for further FP ops

The overlapping-store option is not viable if we want to keep using the vector while we have it in regs. (Which we should: we can still do this cheaply enough that it doesn't justify a separate loop over the data to apply alphas).

The cheapest options to replace the high element is blendps (_mm_blend_ps). Blends with immediate control operands are extremely efficient on the SSE4.1 and later CPUs that support them: 1c latency, and can run on multiple execution ports on SnB and later, so they don't tend to create bottlenecks on specific execution ports. (variable blends are more expensive). insertps (_mm_insert_ps`) is more powerful (e.g. can zero selected elements in the dest, and pick from any element in the src), but requires the shuffle port.

Without SSE4.1, our best option is probably two instructions: mask off the high element with an AND, then OR in the 1.0f from a vector of [ 1.0 0 0 0 ]. The IEEE representation of 0.0f is all-zeros, so we can safely OR without affecting the low elements. This is only 2 instructions.

andps and orps both only run on port5 (which competes with shufps) on Intel Nehalem to Broadwell. Skylake runs them on p015, same as pand and por. If throughput turns out to be the bottleneck, not latency, consider using integer instructions instead (casting to __m128i). It's only an extra 1 cycle of bypass delay (Intel SnB-family) when using the output of por as an input to addps or something.

__m128 apply_alpha(__m128 Pixel) {
    __m128 Alpha = _mm_shuffle_ps(Pixel, Pixel, _MM_SHUFFLE(3, 3, 3, 3));
    __m128 Multiplied = _mm_mul_ps(Pixel, Alpha);
#ifdef __SSE4_1__
    // blendps imm8 is cheaper (runs on more ports) than insertps on Intel SnB-family
    __m128 Alpha_Reset = _mm_blend_ps(Multiplied, _mm_set1_ps(1.0), 1<<3);
#else
    // emulate the blend with AND/OR
    const __m128 zeroalpha_mask = _mm_castsi128_ps( _mm_set_epi32(0,~0,~0,~0) );  // could be generated with pcmpeqw / psrldq 4
    __m128 Alpha_Reset = _mm_and_ps(Multiplied, zeroalpha_mask);
    const __m128 alpha_one = _mm_set_ps(1.0, 0, 0, 0);
    Alpha_Reset = _mm_or_ps(Alpha_Reset, alpha_one);
#endif
    return Alpha_Reset;
}

Calling this in a loop works great with with gcc: It sets up all its constants in registers outside the loop, so inside the loop is just a load, some register ops, and a store.

See the source for my test loop on the Godbolt Compiler Explorer. You can also tack on -march=haswell to enable all the instruction sets it supports, including -msse4.1, and see that the blendps version compiles, too.

loop(float __vector(4)*):
    movaps  xmm4, XMMWORD PTR .LC0[rip] # setup of constants hoisted out of the loop
    lea     rax, [rdi+160000]
    movaps  xmm3, XMMWORD PTR .LC1[rip]
    movaps  xmm2, XMMWORD PTR .LC3[rip]
.L3:
    movaps  xmm1, XMMWORD PTR [rdi]
    add     rdi, 16
    # apply_alpha inlined beginning here
    movaps  xmm0, xmm1                 # This is the insn you forgot to include in the question, for your shufps broadcast without AVX.  It's unavoidable, but still counts
    shufps  xmm0, xmm1, 255
    mulps   xmm0, xmm1
    andps   xmm0, xmm4
    orps    xmm0, xmm3
    # and ends here
    addps   xmm0, xmm2                 # extra add outside of apply_alpha, otherwise a scalar store to set alpha may be better
    movaps  XMMWORD PTR [rdi-16], xmm0
    cmp     rax, rdi
    jne     .L3
    ret

Extending this to 256b vectors is also easy: still use blendps with a constant twice as wide to do 2 pixels at once.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
1

With thanks to all who responded, we settled on a solution that does just one 128 bit memory access, instead of the three the straightforward code I originally listed did:

//  Ensures the result of the multiply leaves a 0 in Alpha.
__m128 ABGZ = _mm_move_ss(Pixel, _mm_setzero_ps());
__m128 ZAAA = _mm_shuffle_ps(ABGZ, ABGZ, _MM_SHUFFLE(0, 3, 3, 3));
__m128 ReturnPixel = _mm_mul_ps(Pixel, ZAAA);
ReturnPixel = _mm_or_ps(ReturnPixel, _mm_set_ps(1.0f, 0, 0, 0));

This generates the following code:

xorps   xmm1, xmm1
movss   xmm2, xmm1
shufps  xmm2, xmm2, 63              ; 0000003fH
mulps   xmm2, xmm0
orps    xmm2, XMMWORD PTR __xmm@3f800000000000000000000000000000

I had hoped for a solution that might generate 1.0f programmatically and keep this code all register work. Oh well. That 128 bit value will no doubt be cached.

One day in the future we'll revisit this when we move the product up to a minimum support level of SSE4.1.

-Noel

NoelC
  • 1,407
  • 10
  • 16
  • 1
    That's pretty good, you avoid the `andps` and it's constant with a `movss xmm,xmm` from a constant that compilers are smart enough to gen on the fly. Keep in mind that `movss` between regs can only run on the shuffle port (port5 on Haswell onwards). If your code bottlenecks on shuffle throughput, then consider using `andps`. If your compiler can hoist the constant loading out of your loop, it's fine. – Peter Cordes May 01 '16 at 21:13