1

This question is an extension of How to check if even/odd lanes are in given ranges using SIMD?.

Given a __m128i which stores 16 chars, the even-index lane refers to even lane (i.e., lanes at 0, 2, 4, ..., 14), and odd-index lane refers to odd lane (i.e, lanes at 1, 3, 5, ... 15). In my application, each group is disjoint, consisting of two adjacent lanes, and thus there are 8 groups.

In my application, there are some extra restrictions (known at compiling time) on every group. To be specific, if the even lane equals to some value (e.g., a), then the next odd lane should be in a given range (e.g., [b, c]); otherwise, the next odd lane can be arbitrary.

Suppose there are two restrictions:

  • 2, [1, 5]
  • 3, [4, 6]
# valid
vec = [2, 4, 3, 5, 1, 10, ...]

# invalid, because when 0-th (even) is 2, 6 is not in range of [1, 5]
vec = [2, 6, 3, 5, 1, 10, ...]

Any idea how to use SIMD intrinsics to check whether the given __m128i satisfies those restrictions?

chenzhongpu
  • 6,193
  • 8
  • 41
  • 79
  • If you have multiple restrictions, is there some easy way to compute the range `[b, c]` from `a`? If not, how many different restrictions can you have? (More than 2, more than 16?) Can you influence how your inputs are created, i.e., could you have them non-interleaved without extra operations? – chtz May 05 '23 at 10:13
  • The pair of `(a, [b, c])` are predefined. In this specific case, there are about 10 restrictions. @chtz – chenzhongpu May 05 '23 at 12:14
  • If you have multiple such conditions, you probably want to combine boolean vectors with SIMD ops like `_mm_and_si128` / `_mm_andnot_si128` / `_mm_or_si128` before one final `_mm_movemask_epi8`, instead of having multiple scalar masks to booleanize and combine. (Or multiple FLAGS conditions to materialize into integer booleans from `_mm_testz_si128`.) – Peter Cordes May 05 '23 at 15:44
  • If you have some that are any-element-true vs. all-elements-true, note that any-true is the same as all-false, so `andnot` might help in combining them. If you have some tests on just the odd elements to be checked with `movemask() & 0x5555 == 0x5555`, you might keep those separate. (Or `movemask() & 0x5555 == 0` is cheaper, can be done with one scalar `test` instruction.) – Peter Cordes May 05 '23 at 15:47
  • Do you want this across multiple vectors, like could it be useful to produce a bitmap from a pair of input `__m128i` vectors, if `_mm_packs_epi16` could be useful in lining up odds with evens? – Peter Cordes May 05 '23 at 16:03
  • These are once again signed ranges? – Peter Cordes May 05 '23 at 16:06
  • If the relevant `a` values are close together (not further than 15 apart), you may want to find the correct min/max-borders using a `pshufb`-lookup. You still leave a lot unclear in your question ... – chtz May 05 '23 at 16:18

2 Answers2

0

Here’s one possible implementation. Tested very little. It requires SSSE3 and SSE 4.1.

#include <emmintrin.h>  // SSE2
#include <tmmintrin.h>  // SSSE3
#include <smmintrin.h>  // SSE 4.1
#include <vector>
#include <assert.h>

// Assemble integer from 4 bytes
inline int intFromBytes( uint8_t b0, uint8_t b1, uint8_t b2, uint8_t b3 )
{
    uint32_t res = b0;
    res |= ( (uint32_t)b1 ) << 8;
    res |= ( (uint32_t)b2 ) << 16;
    res |= ( (uint32_t)b3 ) << 24;
    return (int)res;
}

// Pack the range into a single integer
// Bytes 0 and 1 are group key, same value in both bytes
// Byte 2 is minimum of the range
// Byte 3 is ( 0xFF - ( maximum of the range ) )
inline int makeRangeEntry( uint8_t k, uint8_t first, uint8_t last )
{
    assert( first <= last );
    return intFromBytes( k, k, first, 0xFF - last );
}

// Duplicate even-indexed bytes
inline __m128i moveldup_epi8( __m128i vec )
{
    const __m128i perm = _mm_setr_epi8( 0, 0, 2, 2, 4, 4, 6, 6, 8, 8, 10, 10, 12, 12, 14, 14 );
    return _mm_shuffle_epi8( vec, perm );
}

// Create vector with the following bytes:
// vec[1], 255-vec[1], vec[3], 255-vec[3], ..
inline __m128i unpackComparison( __m128i vec )
{
    vec = _mm_srli_epi16( vec, 8 );
    __m128i high = _mm_sub_epi16( _mm_set1_epi16( 0x00FF ), vec );
    high = _mm_slli_epi16( high, 8 );
    return _mm_or_si128( vec, high );
}

// Compare unsigned bytes for a >= b
inline __m128i cmpge_epu8( __m128i a, __m128i b )
{
    __m128i ax = _mm_max_epu8( a, b );
    return _mm_cmpeq_epi8( ax, a );
}

// Duplicate even-indexed int16 lanes
inline __m128i moveldup_epi16( __m128i v )
{
    __m128i shifted = _mm_slli_epi32( v, 16 );
    return _mm_blend_epi16( shifted, v, 0b01010101 );
}

// Duplicate odd-indexed int16 lanes
inline __m128i movehdup_epi16( __m128i v )
{
    __m128i shifted = _mm_srli_epi32( v, 16 );
    return _mm_blend_epi16( shifted, v, 0b10101010 );
}

// Returns true when valid
// Use makeRangeEntry() function to initialize the vector
bool checkRangesSse( __m128i vec, const std::vector<int>& tests )
{
    // Unpack bytes into 2 vectors: keys with duplicate even lanes,
    // and values from odd lanes, duplicated with flipped signs
    const __m128i keys = moveldup_epi8( vec );
    const __m128i vals = unpackComparison( vec );

    __m128i failed = _mm_setzero_si128();

    for( int e : tests )
    {
        // Broadcast the entry, and unpack into 2 vectors
        const __m128i ev = _mm_set1_epi32( e );
        const __m128i entryKey = moveldup_epi16( ev );
        const __m128i entryRange = movehdup_epi16( ev );

        // Compare keys for equality
        const __m128i eq = _mm_cmpeq_epi8( keys, entryKey );
        // Compare values against the range
        // Because we inverted signs with (255-b), one comparison checks both ends of the range
        const __m128i cmp = cmpge_epu8( vals, entryRange );

        // Passed check when keys are different: ( eq == 0 )
        // Passed check when range is good: ( eq == 0xFFFF && cmp == 0xFFFF )
        // The failed check: ( eq == 0xFFFF && cmp != 0xFFFF )
        __m128i res = _mm_xor_si128( eq, cmp );
        res = _mm_and_si128( res, eq );
        failed = _mm_or_si128( failed, res );
    }

    return (bool)_mm_testz_si128( failed, failed );
}

Update: I’ve just noticed your ranges are known at compile-time.

Ideally, remove the std::vector argument from that function, and instead use a global variable of type const std::array<int, 10> if you have 10 of these ranges.

Also, if you only have 10 ranges, that array gonna take 40 bytes of memory. If you decorate that array with alignas(64) the complete array will be in a single cache line.

Soonts
  • 20,079
  • 9
  • 57
  • 130
  • Style note: I'd write `_mm_set1_epi16( 0xFF )` as `_mm_set1_epi16( 0x00FF )`, writing out the whole pattern that repeats, including the leading zeros. That saves the reader the small amount of mental effort to notice that it's `epi16` and mentally tack on the zeros to see the real pattern. – Peter Cordes May 05 '23 at 15:39
  • @PeterCordes Fixed. Also removed small broadcasts. Without AVX2 they emit too many instructions: `movsx`, `movd`, `punpcklbw`, `punpcklwd`, `pshufd` for a single `_mm_set1_epi8` instrinsic. – Soonts May 05 '23 at 16:48
  • I posted a different answer optimized for repeating the check of the same ranges over different `__m128i` input vectors. (With signed ranges like they were using in their last question). Even if you rearrange yours to preprocess ranges into vectors outside the loop, I think your accumulation with xor/and/or is slightly more expensive than mine which is just `and (even_match, odd_out_of_range)`/`or`. Other than that, about the same amount of work. Can you use `andnot` to combine your checks somehow? – Peter Cordes May 05 '23 at 17:38
0

In your previous question, you said you were using the same ranges over multiple __m128i inputs. Assuming that's true again, it's probably best to shuffle pairs of __m128i inputs into even vs. odd halves, so from then on we only need vertical SIMD operations.

If we want them in order, _mm_packus_epi16 is useful (although you have to _mm_and_si128 both inputs). Otherwise interleave them:

// split to odds/evens once, reusing these across multiple even_key, odd_range pairs
evens = _mm_blendv_epi8(_mm_slli_epi16(v1, 8), v0, _mm_set1_epi16(0x00ff));
odds  = _mm_blendv_epi8(v1, _mm_srli_epi16(v0, 8), _mm_set1_epi16(0x00ff)); 
...
  tmp = movemask(compare_result);
  tmp & 0x5555  // checks elements from v0
  tmp & 0xaaaa  // checks elements from v1

// Or low/high, costing more setup work
__m128i evenmask = _mm_set1_epi16(0x00ff);
evens = _mm_packus_epi16(_mm_and_si128(v0, evenmask), _mm_and_si128(v1, evenmask));
odds  = _mm_packus_epi16(_mm_srli_epi16(v0, 8),       _mm_srli_epi16(v1, 8));
...
   tmp = movemask(compare_result);
   (uint8_t)tmp  // checks elements from v0, like  cmp al, 0xff to check all-set
   tmp>>8        // checks elements from v1, like  cmp ah, 0xff or test eax, 0xff00

If you only have 1 vector, you can use v1 = v0.

For the actual conditional range-checking, use the same setup as your previous question:

// loop-invariant constants, set these up once
  __m128i mins =  _mm_set1_epi8( min - 0x80);
  __m128i rangelen = _mm_set1_epi8( max - (min-0x80) );
  __m128i even_key = _mm_set1_epi8( key );


__m128i odds_out_of_range(__m128i evens, __m128i odds, __m128i even_key, __m128i mins, __m128i rangelen)
{
  __m128i vsub = _mm_sub_epi8(odds, mins);
  __m128i odd_outrange = _mm_cmpgt_epi8(vsub, rangelen);
  // or inrange = cmpgt(rangelen, vsub) I think, for an inclusive range?

  __m128i veven_match = _mm_cmpeq_epi8(evens, even_key);
    // bad only if even matched & odd of range
  __m128i bad = _mm_and_si128(veven_match, odd_outrange);
  return bad;
}

You might call it from a loop over arrays of range vectors. With AVX you'd have free broadcast-load of 32-bit elements; with just SSE4.2 you only have free broadcast-loads of 64-bit elements via SSE3 movddup. For simplicity I've just show using full vectors; that lets them work as memory source operands without even needing a separate load instruction.

#include <immintrin.h>
#include <stdint.h>

struct range {
  int8_t even_key, min, max;
};

struct vrange {
  __m128i even_key, mins, rangelen;
};

// static inline
struct vrange make_vrange(struct range r)
{
// preprocessing to go with our range-check
  struct vrange vr;
  vr.mins =  _mm_set1_epi8( r.min - 0x80);
  vr.rangelen = _mm_set1_epi8( r.max - (r.min-0x80) );
  vr.even_key = _mm_set1_epi8( r.even_key );
  return vr;
}

__m128i odds_out_of_range(__m128i evens, __m128i odds, struct vrange vr)
{
  __m128i vsub = _mm_sub_epi8(odds, vr.mins);
  __m128i odd_outrange = _mm_cmpgt_epi8(vsub, vr.rangelen);
  // or inrange = cmpgt(rangelen, vsub) I think, for an inclusive range?

  __m128i veven_match = _mm_cmpeq_epi8(evens, vr.even_key);
    // bad only if even matched & odd of range
  __m128i bad = _mm_and_si128(veven_match, odd_outrange);
  return bad;
}

__m128i check_ranges(__m128i evens, __m128i odds, struct vrange *vranges, int n)
{
    // peel at least the first iteration so some range vectors can stay in registers
    __m128i problems = odds_out_of_range(evens, odds, vranges[0]);
    // maybe peel a total of 2 or 3
    // If necessary, create dummy accept-everything ranges in case n < 3

    for (int i = 1 ; i<n ; i++){
       __m128i bad = odds_out_of_range(evens, odds, vranges[i]);
       problems = _mm_or_si128(problems, bad);  // accumulate across key/range
    }

    return problems;
}

int find_first_out_of_range(const int8_t *p, size_t len, const struct range *scalar_ranges, int nr)
{
    if (nr > 100) return -2;  // assert not too many ranges for a VLA
    struct vrange vranges[nr];
    for (int i = 0 ; i<nr ; i++) {
        vranges[i] = make_vrange(scalar_ranges[i]);
    }
    // perhaps hoist this preprocessing out of this function?

    for (size_t bytepos = 0 ; bytepos<len ; bytepos+=32){
        __m128i v0 = _mm_loadu_si128((const __m128i*)&p[bytepos+ 0]);
        __m128i v1 = _mm_loadu_si128((const __m128i*)&p[bytepos+16]);

        // split to odds/evens once, reusing these across multiple even_key, odd_range pairs
        __m128i evens = _mm_blendv_epi8(_mm_slli_epi16(v1, 8), v0, _mm_set1_epi16(0x00ff));
        __m128i odds  = _mm_blendv_epi8(v1, _mm_srli_epi16(v0, 8), _mm_set1_epi16(0x00ff));
        __m128i problems = check_ranges(evens, odds, vranges, nr);
        unsigned mask = _mm_movemask_epi8(problems);
        if (mask) {
            if (mask & 0x5555)   // v0
                return bytepos;
            if (mask & 0xaaaa)   // v1
                return bytepos + 16;
        }
    }
    return -1;
}

That full example compiles on Godbolt, but untested other than that. It assumes you have an even number of input vectors to process, although you can use v1 = v0 for a left over solo vector.

It does use memory-source operands for some things, but unfortunately with an indexed addressing mode instead of iterating a pointer. So it'll unlaminate to 2 uops on Sandybridge-family Intel CPUs.


SSE4.2 string instructions (like pcmpestrm) can do multiple range checks, but I don't think they can do it per-element-conditionally on another element. They cost multiple uops (especially the explicit-length versions which are usable with 0 bytes in the data), so probably aren't useful.

In a slightly different case, like if every pair of bytes had to match one of the ranges, you could maybe use pcmpestrm or pcmpestri in 16-bit-element mode, with multiple range-checks. e.g. checking for the range 0x0201-0x0205, or the range 0x0304-0x0306 for each 16-bit chunk, and check that every word was part of a range. (You'd need to byte-reverse within 16-bit elements, using pshufb).

@aqrit posted & deleted an answer attempting that. If they get their idea working and undelete, see their answer. I thought it was interesting enough to be worth a mention as a related case that may not quite work for this one. I think it doesn't work here because the input vector can have an even byte that doesn't match any of the "keys".

Perhaps another pcmpestrm can make a mask of which 8-bit elements match one of the keys, using it's EQUAL_ANY mode. And you can intersect that with another pcmpestrm 16-bit Ranges check which checks each element for being in one of the input ranges. So if an even byte matched any key, the 16-bit element it's part of must be in one of the ranges.

See https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for some human-readable description of what the instructions can do, less opaque than the formal documentation.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • `pcmpistrm` could match bad ranges, since there are only 3 bad ranges in the question. However, it is a bit silly. – aqrit May 05 '23 at 21:50
  • @aqrit: The OP commented that "In this specific case, there are about 10 restrictions", so could well be worth it despite the 9 uop cost of `pcmpestrm` on Intel CPUs, with 5c throughput. Or only 7 uops on Zen with 3 cycle throughput. (`pcmpistrm` is much less bad, only 3 uops all for the same port on Intel, but it stops on `0` terminators which might occur in the data. Maybe worth checking that there aren't any and branching?) – Peter Cordes May 05 '23 at 22:13