How to make premultiplied alpha function faster using SIMD instructions?

Question

I'm looking for some SSE/AVX advice to optimize a routine that premultiplies RGB channel with its alpha channel: RGB * alpha / 255 (+ we keep the original alpha channel).

    for (int i = 0, max = width * height * 4; i < max; i+=4) {
        data[i] = static_cast<uint16_t>(data[i] * data[i+3]) / 255;
        data[i+1] = static_cast<uint16_t>(data[i+1] * data[i+3]) / 255;
        data[i+2] = static_cast<uint16_t>(data[i+2] * data[i+3]) / 255;
    }

You will find below my current implementation but I think it could be much faster and I'm wasting precious CPU cycles. I tested it with quick-bench.com and it shows encouraging results but what should I change to make it blazing fast?

Thanks

-------- UPDATED 09/06/2019 --------

Based on @chtz and @Peter Cordes comments I put together a repository to assess the different solutions here are the results. Do you think it can be better?

Run on (8 X 3100 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 262K (x4)
  L3 Unified 8388K (x1)
Load Average: 1.24, 1.60, 1.68
-----------------------------------------------------------------------------
Benchmark                   Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------
v1_plain_mean         1189884 ns      1189573 ns         1000 itr=840.865/s
v1_plain_median       1184059 ns      1183786 ns         1000 itr=844.747/s
v1_plain_stddev         20575 ns        20166 ns         1000 itr=13.4227/s

v1_simd_x86_mean       297866 ns       297784 ns         1000 itr=3.3616k/s
v1_simd_x86_median     294995 ns       294927 ns         1000 itr=3.39067k/s
v1_simd_x86_stddev       9863 ns         9794 ns         1000 itr=105.51/s

Thanks Dot and Beached (discord #include)
v2_plain_mean          323541 ns       323451 ns         1000 itr=3.09678k/s
v2_plain_median        318932 ns       318855 ns         1000 itr=3.13623k/s
v2_plain_stddev         13598 ns        13542 ns         1000 itr=122.588/s

Thanks Peter Cordes (stackoverflow)
v3_simd_x86_mean       264323 ns       264247 ns         1000 itr=3.79233k/s
v3_simd_x86_median     260641 ns       260560 ns         1000 itr=3.83788k/s
v3_simd_x86_stddev      12466 ns        12422 ns         1000 itr=170.36/s

Thanks chtz (stackoverflow)
v4_simd_x86_mean       266174 ns       266109 ns         1000 itr=3.76502k/s
v4_simd_x86_median     262940 ns       262916 ns         1000 itr=3.8035k/s
v4_simd_x86_stddev      11993 ns        11962 ns         1000 itr=159.906/s

-------- UPDATED 10/06/2019 --------

I added the AVX2 version and used chtz's tip. Using 255 for alpha value in color_odd I was able to remove _mm_blendv_epi8 and improve the benchmark.

Thanks Peter and chtz

v3_simd_x86_mean       246171 ns       246107 ns          100 itr=4.06517k/s
v3_simd_x86_median     245191 ns       245167 ns          100 itr=4.07885k/s
v3_simd_x86_stddev       5423 ns         5406 ns          100 itr=87.13/s

// AVX2
v5_simd_x86_mean       158456 ns       158409 ns          100 itr=6.31411k/s
v5_simd_x86_median     158248 ns       158165 ns          100 itr=6.3225k/s
v5_simd_x86_stddev       2340 ns         2329 ns          100 itr=92.1406/s

What compiler and compiler options did you use to compiler the code? What CPU did you use to run the code? What is the minimum and maximum size of the input array? What does `width` and `height` mean? It would also be useful to provide an [MCVE](https://stackoverflow.com/help/minimal-reproducible-example). — Hadi Brais, Jun 03 '19 at 16:30
@HadiBrais this algorithm is used for images so width and height is the size of the image to transform. For the chart, I used the website http://quick-bench.com/ using clang7.0 with all default options: c++20, O3, libstdc++. — Mathieu Garaud, Jun 03 '19 at 16:49
Just to be clear, even though the godbolt link you provided uses Clang 8.0 with the `-O3 -march=haswell` options, you've actually used Clang 7.0 without `-march=haswell` to produce the results, right? I have no idea what CPU `quick-bench.com` uses to run experiments and what the time unit for the y-axis is and what the size of the data array you used to produce the results. All of that information matters. — Hadi Brais, Jun 03 '19 at 17:01
See also this [related question](https://stackoverflow.com/q/35285324), and the answers below it. — wim, Jun 04 '19 at 14:04
With AVX2 available, you should of course port any implementation to AVX2 ... — chtz, Jun 10 '19 at 06:22
Your test cases are insufficient, make sure you have alpha and RGB values such that `r & a != `r * a / 255U`. e.g. use randomly generated 32-bit pixel values, like from a SIMD `xorshift128+` generator to fill an array quickly. The operator precedence of `&` is quite low: if you look at the generate asm on Godbolt, you'll see that adding parens like `p >> 8 & (0xFFU * a / 255U);` doesn't change the asm for your `plain` test case. `0xFF * a / 255` = `a` so that part optimizes away, and you're doing `g = (p>>8) & a;` which is obviously much cheaper. — Peter Cordes, Jun 10 '19 at 12:05
Plus you're letting the compiler auto-vectorize with AVX2 256-bit vectors. You should be comparing against 256-bit manual vectorization if AVX2 is ok. If you parenthesize it properly, clang does some crazy shuffling to pack down to 16-bit and expand again with `vpmovzxwd` for each component, doing `vpmaddwd` for the alpha multiply and the division with `vpmulhuw` + shift. https://godbolt.org/z/2QeWBq — Peter Cordes, Jun 10 '19 at 12:09

Peter Cordes · Answer 1 · 2019-06-04T17:46:56.670

If you can you use SSSE3, _mm_shuffle_epi8 lets you create the __m128i alpha vector, instead of AND/shift/OR.

pshufb will zero bytes where the high bit of the shuffle-control vector element is set. (Shuffle throughput is easily a bottleneck on Intel Haswell and later, so using immediate shifts or AND is still good for the other operations where you can get it done with one instruction.)

On Skylake and later, it's probably a win to use SSE4.1 pblendvb to merge alpha instead of AND/ANDN/OR. (On Haswell, the 2 uops of pblendvb can only run on port 5. That might actually be ok because there are enough other uops that this won't create a shuffle bottlenck.)

On Skylake, non-VEX pblendvb is a single-uop instruction that runs on any port. (The VEX version is 2 uops for any port, so it's still strictly better than AND/ANDN/OR, but not as good as the SSE version. Although the SSE version uses an implicit XMM0 input, so it costs an extra movdqa instruction unless your loop only ever uses pblendvb with the same blend mask. Or if you unroll then it can maybe amortize that movdqa to set XMM0.)

Also, _mm_srli_epi16 by 7 and _mm_slli_epi16(color_odd, 8) could be just a single shift, with maybe an AND. Or a pblendvb avoids the need to clear garbage like you do before an OR.

I'm not sure if you could use _mm_mulhrs_epi16 to mul-and-shift, but probably not. It isn't the right shift, and the +1 for "rounding" isn't what you want.

Obviously an AVX2 version of this could do twice as much work per instruction, giving a factor 2 speedup on Haswell / Skylake for the main loop. Probably somewhat neutral on Ryzen, where 256b instructions decode into 2 uops. (Or more for lane-crossing shuffles, but you don't have those.)

The worst case cleanup would have to run more times, but this should still be negligible.

Instead of blending in the alpha-channel, you could just set one factor to `255` (thus getting `255*alpha/255==alpha`) — chtz, Jun 04 '19 at 13:35

chtz · Answer 2 · 2019-06-07T17:23:10.427

I was playing around a bit with this. I think the best solution is to split the input from two registers into channels of 16bit integers (i.e., 8bit integer interleaved by 0x00). Then do the actual scaling (taking only 6 multiplications + 3shifts for 8 pixels, instead of 8+4, in your original approach), and then re-join the channels into pixels.

Proof of concept (untested) assuming input is aligned and number of pixels are a multiple of 8, Version 2.0 (see history for previous version):

void alpha_premultiply(__m128i *input, int length)
{
    for(__m128i* last = input + (length & ~1); input!=last; input+=2)
    {
        // load data and split channels:
        __m128i abgr = _mm_load_si128(input);
        __m128i ABGR = _mm_load_si128(input+1);
        __m128i __ab = _mm_srli_epi32(abgr,16);
        __m128i GR__ = _mm_slli_epi32(ABGR,16);
        __m128i ABab = _mm_blend_epi16(ABGR, __ab, 0x55);
        __m128i GRgr = _mm_blend_epi16(GR__, abgr, 0x55);
        __m128i A_a_ = _mm_and_si128(ABab, _mm_set1_epi16(0xFF00));
        __m128i G_g_ = _mm_and_si128(GRgr, _mm_set1_epi16(0xFF00));
        __m128i R_r_ = _mm_slli_epi16(GRgr, 8);
        __m128i B_b_ = _mm_slli_epi16(ABab, 8);

        // actual alpha-scaling:
        __m128i inv = _mm_set1_epi16(0x8081); // = ceil((1<<(16+7))/255.0)
        G_g_ = _mm_mulhi_epu16(_mm_mulhi_epu16(G_g_, A_a_), inv);
        // shift 7 to the right and 8 to the left, or shift 1 to the left and mask:
        G_g_ = _mm_and_si128(_mm_add_epi16(G_g_, G_g_), _mm_set1_epi16(0xFF00));
        __m128i _R_r = _mm_mulhi_epu16(_mm_mulhi_epu16(R_r_, A_a_), inv);
        _R_r = _mm_srli_epi16(_R_r,7);
        __m128i _B_b = _mm_mulhi_epu16(_mm_mulhi_epu16(B_b_, A_a_), inv);
        _B_b = _mm_srli_epi16(_B_b,7);

        // re-assemble channels:
        GRgr = _mm_or_si128(_R_r, G_g_);
        ABab = _mm_or_si128(A_a_, _B_b);

        __m128i __GR = _mm_srli_epi32(GRgr, 16);
        __m128i ab__ = _mm_slli_epi32(ABab, 16);

        ABGR = _mm_blend_epi16(ABab, __GR, 0x55);
        abgr = _mm_blend_epi16(ab__, GRgr, 0x55);

        // store result
        _mm_store_si128(input, abgr);
        _mm_store_si128(input+1, ABGR);
    }
}

Variable names use _ to mark a 0, and lowest address byte is on the right (to be less confusing with shift and bit-operations). Each register will hold 4 sequential pixels, or 4+4 interleaved channels. Lower and uppercase letters are from different input locations. (Godbolt: https://godbolt.org/z/OcxAfJ)

On Haswell (or earlier), this would bottleneck on port 0 (shift and multiplications), but with SSSE3 you could replace all 8- and 16-shifts by _mm_alignr_epi8. And it would be better to leave _R_r and _B_b at the lower bytes (uses a pand instead of a psllw, but requires shifting A_a_ to _A_a). Possible pitfall: clang replaces _mm_alignr_epi8 by corresponding shift instructions: https://godbolt.org/z/BhEZoV (maybe there are flags to prohibit clang to replace these. GCC uses palignr: https://godbolt.org/z/lu-jNQ)

On Skylake this might be optimal as it is (except for porting to AVX2, of course). There are 8 shifts, 6 multiplications and 1 addition, i.e. 15 operations on ports 0 and 1. Furthermore, 4 blends on port 5, and 5 and/or operations (4 on p5 and another on either p0 or p1), i.e., 8 uops per port for 8 pixels (or 16 pixels with AVX2).

Code should be very easy to port to AVX2 (and using AVX1 alone will save some register copies). Finally, to make the code SSE2 compatible, only the blend instructions need to be replaced by corresponding and+or operations.

You can get `__m128i GRgr` from left-shift + `pblendw`, then mask it with AND and ANDN for `_R_r` and `G_g_`. `pblendw` is 1 uop for port 5 on Intel, but you don't have any shuffles and only a few `add` so this looks good. (And on Ryzen it's a single uop for any of 3 ports.) It's too bad there isn't a per-element `punpcklbw` inside each DWORD; that would have been perfect for some of this. But we can't do that until AVX512VBMI `vpermt2b`; 2-input shuffles are rare. — Peter Cordes, Jun 07 '19 at 06:26
This works even better for splitting up `_B_b` and `_A_a` on input because once you have `ABab`, you mask it for `_B_b` and `_mm_srli_epi16(ABab, 8)` for `_A_a`, without having to mask to create `A_a_` first. — Peter Cordes, Jun 07 '19 at 06:37
@PeterCordes As always thanks for the comments! I think the only point I don't fully agree is that I still need `A_a_` (at least for re-joining at the end). I was also considering to shift the input `R_r_` and `B_b_` to the high-bytes as well. This would save shifting `A_a_` to `_A_a` (and R, B would not need masking). I guess I'll provide a v2.0 later. — chtz, Jun 07 '19 at 07:51
Ah I missed that `A_a_` was used later. Maybe use `pblendvb` with `Abab` for that; the non-VEX version is single-uop on Skylake. (And we don't know about Ryzen; Agner Fog is missing it but `blendvps` is 1 uop). But it's 2 uops on earlier CPUs. — Peter Cordes, Jun 07 '19 at 07:54

How to make premultiplied alpha function faster using SIMD instructions?

2 Answers2