How to vectorize range check during block copy?

Question

I have the function below:

void CopyImageBitsWithAlphaRGBA(unsigned char *dest, const unsigned char *src, int w, int stride, int h,
    unsigned char minredmask, unsigned char mingreenmask, unsigned char minbluemask, unsigned char maxredmask, unsigned char maxgreenmask, unsigned char maxbluemask)
{
    auto pend = src + w * h * 4;
    for (auto p = src; p < pend; p += 4, dest += 4)
    {
        dest[0] = p[0]; dest[1] = p[1]; dest[2] = p[2];
        if ((p[0] >= minredmask && p[0] <= maxredmask) || (p[1] >= mingreenmask && p[1] <= maxgreenmask) || (p[2] >= minbluemask && p[2] <= maxbluemask))
            dest[3] = 255;
        else
            dest[3] = 0;
    }
}

What it does is it copies a 32 bit bitmap from one memory block to another, setting the alpha channel to fully transparent when the pixel color falls within a certain color range.

How do I make this use SSE/AVX in VC++ 2017? Right now it's not generating vectorized code. Failing an automatic way of doing it, what functions can I use to do this myself?

Because really, I'd imagine testing if bytes are in a range would be one of the most obviously useful operations possible, but I can't see any built in function to take care of it.

Autovectorization can be triggered with a small change: https://godbolt.org/g/aMZJ5m (I don't like the result but it *is* vectorized) — harold, Mar 27 '18 at 15:31
Interesting, but that doesn't seem to trigger auto-vectorization in VC++ too, only in clang. Any ideas? — Blindy, Mar 27 '18 at 15:53
As a follow up, the message I'm getting is reason 1301, the stride is not 1. I mean it's obviously not 1, I want it to work in groups of 4. I even tried casting it to `uint32_t` and casting back to `uint8_t` for the array access and that also didn't work, same 1301. — Blindy, Mar 27 '18 at 16:03
@user703016, which is fine, but do you have any ideas as to what functions I need to use to get my intended result? — Blindy, Mar 27 '18 at 16:58
Can you leave alpha unmodified instead of setting to 255, in that branch of the `if`? In the text you only mention clearing alpha, not forcing to opaque in the other case. — Peter Cordes, Mar 27 '18 at 17:36

Peter Cordes · Accepted Answer · 2018-03-31T23:12:38.403

I don't think you're going to get a compiler to auto-vectorize as well as you can do by hand with Intel's intrinsics. (err, as well as I can do by hand anyway :P).

Possibly once we manually vectorize it, we can see how to hand-hold a compiler with scalar code that works that way, but we really need packed-compare into a 0/0xFF with byte elements, and it's hard to write something in C that compilers will auto-vectorize well. The default integer promotions mean that most C expressions actually produce 32-bit results, even when you use uint8_t, and that often tricks compilers into unpacking 8-bit to 32-bit elements, costing a lot of shuffles on top of the automatic factor of 4 throughput loss (fewer elements per register), like in @harold's small tweak to your source.

SSE/AVX (before AVX512) has signed comparisons for SIMD integer, not unsigned. But you can range-shift things to signed -128..127 by subtracting 128. XOR (add-without-carry) is slightly more efficient on some CPUs, so you actually just XOR with 0x80 to flip the high bit. But mathematically you're subtracting 128 from a 0..255 unsigned value, giving a -128..127 signed value.

It's even still possible to implement the "unsigned compare trick" of (x-min) < (max-min). (For example, detecting alphabetic ASCII characters). As a bonus, we can bake the range-shift into that subtract. If x<min, it wraps around and becomes a large value greater than max-min. This obviously works for unsigned, but it does in fact work (with a range-shifted max-min) with SSE/AVX2 signed-compare instructions. (A previous version of this answer claimed this trick only worked if max-min < 128, but that's not the case. x-min can't wrap all the way around and become lower than max-min, or get into that range if it started above max).

An earlier version of this answer had code that made the range exclusive, i.e. not including the ends, so you even redmin=0 / redmax=255 would exclude pixels with red=0 or red=255. But I solved that by comparing the other way (thanks to ideas from @Nejc's and @chtz's answers).

@chtz's idea of using a saturating add/sub instead of a compare is very cool. If you arrange things so saturation means in-range, it works for an inclusive range. (And you can set the Alpha component to a known value by choosing a min/max that makes all 256 possible inputs in-range). This lets us avoid range-shifting to signed, because unsigned-saturation is available

We can combine the sub/cmp range-check with the saturation trick to do sub (wraps on out-of-bounds low) / subs (only reaches zero if the first sub didn't wrap). Then we don't need an andnot or or to combine two separate checks on each component; we already have a 0 / non-zero result in one vector.

So it only takes two operations to give us a 32-bit value for the whole pixel that we can check. Iff all 3 RGB components are in-range, that element will have a specific value. (Because we've arranged for the Alpha component to already give a known value, too). If any of the 3 components are out-of-range, it will have some other value.

If you do this the other way, so saturation means out-of-range, then you have an exclusive range in that direction, because you can't choose a limit such that no value reaches 0 or reaches 255. You can always saturate the alpha component to give yourself a known value there, regardless of what it means for the RGB components. An exclusive range would let you abuse this function to be always-false by choosing a range that no pixel could ever match. (Or if there's a third condition, besides per-component min/max, then maybe you want an override).

The obvious thing would be to use a packed-compare instruction with 32-bit element size (_mm256_cmpeq_epi32 / vpcmpeqd) to generate a 0xFF or 0x00 (which we can apply / blend into the original RGB pixel value) for in/out of range.

// AVX2 core idea: wrapping-compare trick with saturation to achieve unsigned compare
__m256i tmp = _mm256_sub_epi8(src, min_values);       // wraps to high unsigned if below min
__m256i RGB_inrange = _mm256_subs_epu8(tmp, max_minus_min);  // unsigned saturation to 0 means in-range
__m256i new_alpha = _mm256_cmpeq_epi32(RGB_inrange, _mm256_setzero_si256());

// then blend the high byte of each element with RGB from the src vector
__m256i alpha_replaced = _mm256_blendv_epi8(new_alpha, src, _mm256_set1_epi32(0x00FFFFFF));  // alpha from new_alpha, RGB from src

Note that an SSE2 version would only need one MOVDQA instructions to copy src; the same register is the destination for every instruction.

Also note that you could saturate the other direction: add then adds (with (256-max) and (256-(min-max)), I think) to saturate to 0xFF for in-range. This could be useful with AVX512BW if you use zero-masking with a fixed mask (e.g. for alpha) or variable mask (for some other condition) to exclude a component based on some other condition. AVX512BW zero-masking for the sub/subs version would consider components in-range even when they aren't, which could also be useful.

But extending that to AVX512 requires a different approach: AVX512 compares produce a bit-mask (in a mask register), not a vector, so we can't turn around and use the high byte of each 32-bit compare result separately.

Instead of cmpeq_epi32, we can produce the value we want in the high byte of each pixel using carry/borrow from a subtract, which propagates left to right.

0x00000000 - 1 = 0xFFFFFFFF     # high byte = 0xFF = new alpha
0x00?????? - 1 = 0x00??????     # high byte = 0x00 = new alpha
Where ?????? has at least one non-zero bit, so it's a 32-bit number >=0 and <=0x00FFFFFFFF
Remember we choose an alpha range that makes the high byte always zero

i.e. _mm256_sub_epi32(RGB_inrange, _mm_set1_epi32(1)). We only need the high byte of each 32-bit element to have the alpha value we want, because we use a byte-blend to merge it with the source RGB values. For AVX512, this avoids a VPMOVM2D zmm1, k1 instruction to convert a compare result back into a vector of 0/-1, or (much more expensive) to interleave each mask bit with 3 zeros to use it for a byte-blend.

This sub instead of cmp has a minor advantage even for AVX2: sub_epi32 runs on more ports on Skylake (p0/p1/p5 vs. p0/p1 for pcmpgt/pcmpeq). On all other CPUs, vector integer add/sub run on the same ports as vector integer compare. (Agner Fog's instruction tables).

Also, if you compile _mm256_cmpeq_epi32() with -march=native on a CPU with AVX512, or otherwise enable AVX512 and then compile normal AVX2 intrinsics, some compilers will stupidly use AVX512 compare-into-mask and then expand back to a vector instead of just using the VEX-coded vpcmpeqd. Thus, we use sub instead of cmp even for the _mm256 intrinsics version, because I already spent the time to figure it out and show that it's at least as efficient in the normal case of compiling for regular AVX2. (Although _mm256_setzero_si256() is cheaper than set1(1); vpxor can zero a register cheaply instead of loading a constant, but this setup happens outside the loop.)

#include <immintrin.h>

#ifdef __AVX2__
// inclusive min and max
__m256i  setAlphaFromRangeCheck_AVX2(__m256i src, __m256i mins, __m256i max_minus_min)
{
    __m256i tmp = _mm256_sub_epi8(src, mins);   // out-of-range wraps to a high signed value

    // (x-min) <= (max-min)  equivalent to:
    // (x-min) - (max-min) saturates to zero
    __m256i RGB_inrange = _mm256_subs_epu8(tmp, max_minus_min);
    // 0x00000000 for in-range pixels, 0x00?????? (some higher value) otherwise

    // this has minor advantages over compare against zero, see full comments on Godbolt    
    __m256i new_alpha = _mm256_sub_epi32(RGB_inrange, _mm256_set1_epi32(1));
    // 0x00000000 - 1  = 0xFFFFFFFF
    // 0x00?????? - 1  = 0x00??????    high byte = new alpha value

    const __m256i RGB_mask = _mm256_set1_epi32(0x00FFFFFF);  // blend mask
    // without AVX512, the only byte-granularity blend is a 2-uop variable-blend with a control register
    // On Ryzen, it's only 1c latency, so probably 1 uop that can only run on one port.  (1c throughput).
    // For 256-bit, that's 2 uops of course.
    __m256i alpha_replaced = _mm256_blendv_epi8(new_alpha, src, RGB_mask);  // RGB from src, 0/FF from new_alpha

    return alpha_replaced;
}
#endif  // __AVX2__

Set up vector args for this function and loop over your array with _mm256_load_si256 / _mm256_store_si256. (Or loadu/storeu if you can't guarantee alignment.)

This compiles very efficiently (Godbolt Compiler explorer) with gcc, clang, and MSVC. (AVX2 version on Godbolt is good, AVX512 and SSE versions are still a mess, not all the tricks applied to them yet.)

;; MSVC's inner loop from a caller that loops over an array with it:
;; see the Godbolt link
$LL4@:
    vmovdqu ymm3, YMMWORD PTR [rdx+rax*4]
    vpsubb   ymm0, ymm3, ymm7
    vpsubusb ymm1, ymm0, ymm6
    vpsubd   ymm2, ymm1, ymm5
    vpblendvb ymm3, ymm2, ymm3, ymm4
    vmovdqu YMMWORD PTR [rcx+rax*4], ymm3
    add      eax, 8
    cmp      eax, r8d
    jb       SHORT $LL4@

So MSVC managed to hoist the constant setup after inlining. We get similar loops from gcc/clang.

The loop has 4 vector ALU instructions, one of which takes 2 uops. Total 5 vector ALU uops. But total fused-domain uops on Haswell/Skylake = 9 with no unrolling, so with luck this can run at 32 bytes (1 vector) per 2.25 clock cycles. It could come close to actually achieving that with data hot in L1d or L2 cache, but L3 or memory would be a bottleneck. With unrolling, it could maybe bottlenck on L2 cache bandwidth.

An AVX512 version (also included in the Godbolt link), only needs 1 uop to blend, and could run faster in vectors per cycle, thus more than twice as fast using 512-byte vectors.

FWIW, the new version doesn't change the alpha at all, it's fully solid. — Blindy, Apr 02 '18 at 16:13
@Blindy: Did you copy the caller from my Godbolt link? I forgot to update it, just the function itself. So the min/max conditions were maybe including every pixel? When you said "doesn't change", did you actually mean it sets every pixel's alpha to 0xFF? So it's different from the input pixels, but doesn't depend on what it should? — Peter Cordes, Apr 03 '18 at 19:31

Nejc · Answer 2 · 2018-03-27T23:21:48.163

This is one possible way to make this function work with SSE instructions. I used SSE instead of AVX because I wanted to keep the answer simple. Once you understand how the solution works, rewriting the function with AVX intrinsics should not be much of a problem though.

EDIT: please note that my approach is very similar to one by PeterCordes, but his code should be faster because he uses AVX. If you want to rewrite the function below with AVX intrinsics, change step value to 8.

void CopyImageBitsWithAlphaRGBA(
  unsigned char *dest,
  const unsigned char *src, int w, int stride, int h,
  unsigned char minred, unsigned char mingre, unsigned char minblu,
  unsigned char maxred, unsigned char maxgre, unsigned char maxblu)
{
  char low = 0x80; // -128
  char high = 0x7f; // 127
  char mnr = *(char*)(&minred) - low;
  char mng = *(char*)(&mingre) - low;
  char mnb = *(char*)(&minblu) - low;
  int32_t lowest = mnr | (mng << 8) | (mnb << 16) | (low << 24);

  char mxr = *(char*)(&maxred) - low;
  char mxg = *(char*)(&maxgre) - low;
  char mxb = *(char*)(&maxblu) - low;
  int32_t highest = mxr | (mxg << 8) | (mxb << 16) | (high << 24);

  // SSE
  int step = 4;
  int sse_width = (w / step)*step;

  for (int y = 0; y < h; ++y)
  {
    for (int x = 0; x < w; x += step)
    {
      if (x == sse_width)
      {
        x = w - step;
      }

      int ptr_offset = y * stride + x;
      const unsigned char* src_ptr = src + ptr_offset;
      unsigned char* dst_ptr = dest + ptr_offset;

      __m128i loaded = _mm_loadu_si128((__m128i*)src_ptr);

      // subtract 128 from every 8-bit int
      __m128i subtracted = _mm_sub_epi8(loaded, _mm_set1_epi8(low));

      // greater than top limit? 
      __m128i masks_hi = _mm_cmpgt_epi8(subtracted, _mm_set1_epi32(highest));

     // lower that bottom limit?
     __m128i masks_lo = _mm_cmplt_epi8(subtracted, _mm_set1_epi32(lowest));

     // perform OR operation on both masks
     __m128i combined = _mm_or_si128(masks_hi, masks_lo);

     // are 32-bit integers equal to zero?
     __m128i eqzer = _mm_cmpeq_epi32(combined, _mm_setzero_si128());

     __m128i shifted = _mm_slli_epi32(eqzer, 24);

    // EDIT: fixed a bug:
     __m128 alpha_unmasked = _mm_and_si128(loaded, _mm_set1_epi32(0x00ffffff));

     __m128i combined = _mm_or_si128(alpha_unmasked, shifted);

     _mm_storeu_si128((__m128i*)dst_ptr, combined);
    }
  }
}

EDIT: as @PeterCordes stated in the comments, the code included a bug that is now fixed.

The OP wants to *clear* the alpha channel on pixels outside, not leave it unmodified. You're ORing with 0 or 0xFF, not blending with 0 or 0xFF. When you need to do that, it doesn't save anything (until AVX512) to do the compares the other way and set up for OR. This way needs an extra vector constant (setzero). — Peter Cordes, Mar 27 '18 at 22:13
And note that you definitely want to use ADD `-128` or XOR, not `sub_epi8` for the range shift to signed. If the compiler doesn't optimize it to ADD, it can't `vpsubb xmm0, [src], xmm7` because `vpsubb` only allows the 2nd source operand to be memory, not the first source. (Mostly matters with AVX; with SSE it's better to MOVDQU load then `psubb xmm0, xmm_constant`, instead of copying the constant *and* using a micro-fused `pxor xmm0, [mem]`). https://stackoverflow.com/questions/35443424/associativity-gives-us-parallelizability-but-what-does-commutativity-give — Peter Cordes, Mar 27 '18 at 22:15
thanks for both comments. You're right about blending, I will correct my code. — Nejc, Mar 27 '18 at 22:29
BTW, I don't understand the *I didn't want to deal with the lack of less-or-equal intrinsic* remark. AVX2 isn't missing anything that SSE2 / SSE4 has, unless it's just an issue of intrinsics that reverse their args for you. The hardware only has `pcmpeq` and `pcmpgt` for integer, until AVX512 fantastically expands the choice of predicates for integer compares to include everything possible for signed and unsigned. — Peter Cordes, Mar 27 '18 at 23:03
oh, I did a typo there: I meant less-than intrinsic. I could use greater-than with reversed arguments, of course, but then I would also have to check whether the elements are equal. — Nejc, Mar 27 '18 at 23:29
`x < y` is the same condition as `y > x`, not `y >= x`. That's why there can be a `cmplt` intrinsic, but not a `cmplte`. `x <= y` is `!(x > y)`. Combining the compare results with `andnot` can do that for free, to make one end of the range inclusive and the other non-inclusive. (But @chtz's saturating add/sub way is even better) — Peter Cordes, Mar 28 '18 at 14:10

chtz · Answer 3 · 2018-03-28T22:11:26.840

2

Based on @PeterCordes solution, but replacing the shift+compare by saturated subtract and adding:

// mins_compl shall be [255-minR, 255-minG, 255-minB, 0]
// maxs       shall be [maxR, maxG, maxB, 0]
__m256i  setAlphaFromRangeCheck(__m256i src, __m256i mins_compl, __m256i maxs)
{
    __m256i in_lo = _mm256_adds_epu8(src, mins_compl); // is 255 iff src+mins_coml>=255, i.e. src>=mins
    __m256i in_hi = _mm256_subs_epu8(src, maxs);       // is 0 iff src - maxs <= 0, i.e., src <= maxs

    __m256i inbounds_components = _mm256_andnot_si256(in_hi, in_lo);
    // per-component mask, 0xff, iff (mins<=src && src<=maxs).
    // alpha-channel is always (~src & src) == 0

    // Use a 32-bit element compare to check that all 3 components are in-range
    __m256i RGB_mask = _mm256_set1_epi32(0x00FFFFFF);
    __m256i inbounds = _mm256_cmpeq_epi32(inbounds_components, RGB_mask);

    __m256i new_alpha = _mm256_slli_epi32(inbounds, 24);
    // alternatively _mm256_andnot_si256(RGB_mask, inbounds) ?

    // byte blends (vpblendvb) are at least 2 uops, and Haswell requires port5
    // instead clear alpha and then OR in the new alpha (0 or 0xFF)
    __m256i alphacleared = _mm256_and_si256(src, RGB_mask);   // off the critical path
    __m256i new_alpha_applied = _mm256_or_si256(alphacleared, new_alpha);

    return new_alpha_applied;
}

This saves on vpxor (no modification of src required) and one vpand (the alpha-channel is automatically 0 -- I guess that would be possible with Peter's solution as well by choosing the boundaries accordingly).

Godbolt-Link, apparently, neither gcc nor clang think it is worthwhile to re-use RGB_mask for both usages ...

Simple testing with SSE2 variant: https://wandbox.org/permlink/eVzFHljxfTX5HDcq (you can play around with the source and the boundaries)

edited Mar 28 '18 at 22:11

answered Mar 28 '18 at 11:45

chtz

17,329
4
26
56

That's amusing that clang broadcasts the constant into a register once, then uses it as a memory operand the next time. Should all sort itself out after inlining into a loop that hoists that one constant. And yes, `_mm256_andnot_si256` with RGB_mask is nice, thought of that myself after posting, hadn't yet come back to edit. – Peter Cordes Mar 28 '18 at 13:25
`iff` is the usual [abbreviation for if-and-only-if](https://en.wikipedia.org/wiki/If_and_only_if), not `iif`. I think that's what you mean, right? – Peter Cordes Mar 28 '18 at 13:27
saturating add/sub is nice! Haswell only runs them on p1/p5, not p0/p1/p5 like booleans, but Skylake runs them on all 3 ports. My solution can avoid the XOR only if the range between min and max is less than 128, but this avoids it always. – Peter Cordes Mar 28 '18 at 13:39
@PeterCordes yes, I meant if-and-only-if. I copied the wrong shift instruction from your godbolt-link (if you want to fix that as well) – chtz Mar 28 '18 at 14:42
1

Yeah, oops :P. Working on an AVX512 version for an edit; it's interesting because you have to work around the compare-into-mask to do mixed element-size stuff. You can't just ask for 4 mask bits from a 32-bit compare. Also, with `_mm256_andnot_si256(RGB_mask, inbounds)`, those three andnot/and/or optimize into a byte blend. `vpblendv` isn't better than 2 booleans, but it's better than 3 (especially on Skylake where it runs on 2p015, not just 2p5 for Haswell/BDW.) – Peter Cordes Mar 28 '18 at 15:40
Also realized that mine has exclusive boundary conditions, so it's impossible to choose a `min` and `max` that include either `0` or `0xff` as in-range for any component. That's what makes it possible to get alpha_cmp_result=0 regardless of input, xD. Yours is neat; being inclusive but still getting alpha=0 for free. – Peter Cordes Mar 28 '18 at 15:43
No wait a minute, your version is exclusive, too. With `max=0xff`, an input of `0xff` will give a subs result of 0, which means out-of-range. So it's not exactly what the OP asked for. – Peter Cordes Mar 28 '18 at 18:02
1

@PeterCordes I think that part is correct. `in_hi==0` means inside range -- that's why it gets andnot'ed. And if `src>max` the result will be `>0`, which will make the result `!=0xff`. – chtz Mar 28 '18 at 20:04
1

Finally got around to finishing an edit on my answer. Down to 4 insns / 5 uops for the ALU work, not including load/store + loop overhead. Should run better than one vector per 2 clocks on Skylake. – Peter Cordes Mar 31 '18 at 23:13

How to vectorize range check during block copy?

3 Answers3

Linked