2

Given a __m128i which stores 16 chars, the even-index lane refers to even lane (i.e., lanes at 0, 2, 4, ..., 14), and odd-index lane refers to odd lane (i.e, lanes at 1, 3, 5, ... 15).

In my application, even/odd lane must be in given ranges. For example, suppose even_min is 1, even_max is 7, odd_min is 5, and odd_max is 10:

# valid
vec1: [1, 5, 6, 10, 2, 6, 4, 6, 2, 7, 4, 9, 2, 7, 4, 8] 

# invalid because 0-th (even) is greater than even_max
vec2: [8, 5, 6, 10, 2, 6, 4, 6, 2, 7, 4, 9, 2, 7, 4, 8]

How to check whether it is valid more efficiently?

My current solution is very straightforward by checking the two comparison results respectively:

  __m128i even_min = _mm_set1_epi8(xxx);
  __m128i even_max = _mm_set1_epi8(xxx);
  __m128i even_mask =
      _mm_set_epi8(0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1);

  __m128i evenRange = _mm_and_si128(_mm_cmpge_epi8(vec, even_min),
                                    _mm_cmple_epi8(vec, even_max));
  bool isEvenOk = _mm_testc_si128(evenRange, even_mask);

// the code for checking odd bytes is similar

Note that to compare unsigned chars using inclusive condition, two macros are defined as following:

#define _mm_cmpge_epi8(a, b) _mm_cmpeq_epi8(_mm_max_epu8(a, b), a)

#define _mm_cmple_epi8(a, b) _mm_cmpge_epi8(b, a)
chenzhongpu
  • 6,193
  • 8
  • 41
  • 79
  • 1
    Probably interleave the even/odd constants so you have vectors with the correct per-lane mins and maxes. Also note that `_mm_cmpge_epi8` doesn't exist; before AVX-512, you only have two integer compare instructions: equal, and signed-greater. IIRC, saturating subtract can be useful to set up for a compare. As in [How to vectorize range check during block copy?](https://stackoverflow.com/q/49516867) – Peter Cordes May 04 '23 at 15:29
  • 1
    [How to count the number of bytes which lies in some range using SSE?](https://stackoverflow.com/q/10609188) shows there is `_mm_cmpgt_epi8` and `_mm_cmplt_epi8` ( which simply uses `pcmpgtb` with the operands reversed). So you can adjust your ranges to be exclusive instead of inclusive, or reverse your conditions to want a false result, like `(x > max) | (x < min)`. – Peter Cordes May 04 '23 at 15:36
  • 1
    Do you have different min/max values for every `__m128i` you want to check? i.e. how important is it to optimize the creation of the interleaved per-element min/max vectors? Do they come from scalar computations, so probably start in integer registers (not memory), so it might make sense to shift them together like `_mm_set1_epi16((odd_max << 8) | even_max)`, rather than `_mm_unpacklo_epi8` – Peter Cordes May 04 '23 at 15:46
  • 2
    In case if you later want to clamp the vector elements to these min/max limits, it might be beneficial to clamp first using `_mm_min_epi8`+`_mm_max_epi8`. Then you can test whether clamping actually changed any elements by applying `_mm_cmpeq_epi8` between the original vector and the clamped one. – Andrey Semashev May 04 '23 at 17:51
  • 1
    Are all ranges given at compile time (or at least determined before anything performance critical happens)? Can the range be larger than 128? Do you want to check only one `__m128i` at the time? What do you use the boolean result for? I think you can solve this with one `paddb` one `paddusb` and one `pmovmskb`. – chtz May 04 '23 at 21:27
  • @PeterCordes `_mm_cmpge_epi8` is a self-defined `#define` for inclusive comparing. But either inclusive or exclusive here is not a major concern. I would like to know whether it is possible to optimize this task by occurring less instructions (and CPU cycles). For every `__m128i`, the even/odd min/max is fixed. – chenzhongpu May 05 '23 at 01:22
  • If performance is your goal, how is inclusive vs exclusive not a major concern? IDK how you defined `_mm_cmpge_epi8`, but there's no single instruction that does it. So optimizing in terms of operations that the machine provides directly can let you see the real costs of what you're doing. – Peter Cordes May 05 '23 at 01:27
  • @chtz Yes, all ranges are given at compile time. Its range can be larger than 128, but it can be solved easily. If the boolean result is false, then we can terminate the checking safely. – chenzhongpu May 05 '23 at 01:31
  • @PeterCordes To compare unsigned chars, here is the macro I used: `#define _mm_cmpge_epi8(a, b) _mm_cmpeq_epi8(_mm_max_epu8(a, b), a)`. – chenzhongpu May 05 '23 at 01:41
  • Well there you go, you could have been doing `vec == min(max(vec, a), b))` to get the whole thing done in 3 instructions like Andrey suggested, instead of your 5. Or with exclusive ranges, `(vmin-1)` which has better ILP (instruction-level parallelism), but would need a `movdqa` register-copy if you can't compile with AVX enabled. (Either way, 2 more uops to get a condition out of it, either `pmovmskb`/`test` (which can macro-fuse with a JCC), or `ptest` (which can't macro-fuse). Or actually, the final `&` could be done with `ptest` on the two separate cmp results. – Peter Cordes May 05 '23 at 01:56
  • Since the ranges are compile-time constants, if either one is right at the limit, like min == INT8_MIN, then it's always true and we can optimize away that check. If you had runtime variables, then you could use inclusive ranges and be looking for false, like `(v>max) | (v` means without AVX you still definitely need a `movdqa`, since `pcmpgtb` with reversed operands will need to overwrite a copy of the min vector. So on CPUs with mov-elim, it's a tradeoff of front-end throughput vs. latency. – Peter Cordes May 05 '23 at 01:58
  • @AndreySemashev Thanks. Does the `clamp` mean `_mm_unpacklo_epi8` so that two comparisons are done in one operation? I tried this intrinsic, and the performance is improved by about 20%. – chenzhongpu May 05 '23 at 02:48
  • `_mm_unpacklo_epi8` could be used to combine the min/max limits for even/odd elements. Or you could use `_mm_set1_epi16` like Peter suggested. Regardless how you combine the limits into vectors, clamping only requires one `_mm_min_epi8` and one `_mm_max_epi8`, which will both operate on even and odd elements at the same time. – Andrey Semashev May 05 '23 at 15:17
  • Also, note that `_mm_cmpge_epi8` is the same as `!_mm_cmplt_epi8`, meaning you didn't have to use `_mm_max_epu8`+`_mm_cmpeq_epi8` to implement it. You can simply invert the conditions on the result of `_mm_cmplt_epi8` (which will get translated to `pcmpgtb` and an appropriate inverted flags check). – Andrey Semashev May 05 '23 at 15:24

1 Answers1

2

Make one vector of per-lane min values, and another of per-lane maxes. For example, _mm_set1_epi16((odd_min<<8) | (uint8_t)even_min). (Note the cast to avoid sign-extension).

Then you only need one range-check. Which you should do more efficiently, not by emulating cmpge and cmple with 2 instructions each. A simple way, as suggested by Andrey in comments, is v == min(max(v, a), b), which is the same idea as your v == min(v, a).

Since you're using the same mins/maxes over many inputs, some extra setup work is worth it to make each range-check cheaper. The normal range-check trick of c - min < max-min uses an unsigned compare, but we can do it with SSE signed compares by flipping the MSB of both sides, i.e. adding or subtracting 0x80. This is like range-shifting unsigned to signed. This can be part of the same subtraction, c - min - 0x80 < max - min - 0x80 (signed compare). (Thanks, @amonakov, for the reminder that this was possible.)

// unsigned compare-trick, range-shifted for use with  pcmpgtb

// loop-invariant constants, set these up once
  __m128i mins = _mm_set1_epi16( ((odd_min<<8) | (uint8_t)even_min) ^ 0x8080);
  // if (odd_max == 0x7F && even_max == 0x7F){ ... }  // TODO: just check vec > mins
  __m128i maxes = _mm_set1_epi16( ((odd_max<<8) | (uint8_t)even_max) );
  __m128i rangelen = _mm_sub_epi8(maxes, mins);   // includes the 0x80 top bit from mins
   // compilers will constant-propagate through this, except maybe MSVC.  If that's a problem, write it a different way.

// Work inside the loop
  __m128i vsub = _mm_sub_epi8(vec, mins);
  __m128i vout_of_range = _mm_cmpgt_epi8(vsub, rangelen);
  // TODO: check for off-by-one errors in case I got this wrong, or inclusive vs. exclusive.
   // consider mins = 0^0x80, maxes = 1, rangelen=1^0x80 = -127.  
   // vec = 2: vsub = 2^0x80 = -126.  -126 > -127 so it's out-of-range (by 2; this range is exclusive at the top).

  bool isOk = !_mm_movemask_epi8(vout_of_range);  // ok if no bits set

@chtz suggests one paddb + paddusb + pmovmskb can be possible if the size of your range is less than 128. So in-range values won't have the MSB set in each byte but out-of-range values will end up greater than 128. (And can't wrap around because of saturation.) pmovmskb grabs the MSB of each byte, so it works without needing a compare result. psubb / pcmpgtb should be equally good on most CPUs. (Checking for != 0 is as cheap as == 0 for the bitmap result.)


Other ways: worse than sub / cmpgt, better than min/max/cmpeq

Other possibilities include (v < mins) | (v > maxes) and checking that no elements are true. _mm_movemask_epi8(or_result) == 0. This has better critical-path latency than min/max/cmpeq since we have two independent compares, not a chain of 3 operations. Both ways need a copy of the original vector (unless you compile with AVX to allow a separate destination).

Or (v > min-1) & (v < max+1), which is viable for compile-time-constant min/max. If min is already INT8_MIN, it's always true so it optimizes to just needing the other condition. Except that's a problem when even_min is -128 but odd_min is something else: there's no value that will make pcmpgtb always true for all inputs in the even lanes while still checking the odd lanes. I had been thinking the AND could be done as part of a ptest (_mm_test_*), but actually there's no _mm_test_all_ones. ZF is cleared if there's any non-zero bit in the 128-bit AND result. (And same for CF, based on the ANDN result.)

Or use cmpgt both times and invert one of the results as part of combining them, e.g. with _mm_andnot_si128 (pandn)


ptest is not very efficient on compare results since it decodes to 2 uops on most CPUs; pmovmskb + scalar cmp or test is also 2 uops (https://uops.info), and the cmp or test can macro-fuse with a branch if you're branching on it. ptest does avoid needing a temporary register and might save a movdqa register-copy if you're testing a vector that you also want to use later (not a comparison result), but usually good only if you're actually using its ability to check only some elements (like with your odd/even masks).

In your case, even with your strategy of two separate compares, probably better strategies would be 2x _mm_movemask_epi8 and (evens & (odds>>1) & 0x5555 == 0x5555. (0x5555 is 0b0101...0101, just testing the even elements).

Or _mm_srli_epi16(odds, 8) / _mm_and_si128(evens, shifted_odds) to get a vector where the even elements have the results you care about. (And the odd elements are zero because the logical shift produced zeros there, so _mm_movemask_epi8(and_result) == 0x5555 without needing to mask away the elements we don't care about.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    Surely `vec - min - 0x80 > max - min - 0x80` works as well and boils down to 2 uops? – amonakov May 05 '23 at 05:01
  • 1
    @amonakov: Oh yes, the old range-check trick normally done with an unsigned compare. That should work to bake the signed to unsigned range-shift into the sub / cmp; nice one. I overlooked that possibility when considering the sub/cmp range-check. – Peter Cordes May 05 '23 at 05:17
  • Indeed, range-shifting everything is way better than my idea with using `paddusb` (I was trying to work-around the missing unsigned comparison). – chtz May 05 '23 at 09:11