In your previous question, you said you were using the same ranges over multiple __m128i
inputs. Assuming that's true again, it's probably best to shuffle pairs of __m128i
inputs into even vs. odd halves, so from then on we only need vertical SIMD operations.
If we want them in order, _mm_packus_epi16
is useful (although you have to _mm_and_si128
both inputs). Otherwise interleave them:
// split to odds/evens once, reusing these across multiple even_key, odd_range pairs
evens = _mm_blendv_epi8(_mm_slli_epi16(v1, 8), v0, _mm_set1_epi16(0x00ff));
odds = _mm_blendv_epi8(v1, _mm_srli_epi16(v0, 8), _mm_set1_epi16(0x00ff));
...
tmp = movemask(compare_result);
tmp & 0x5555 // checks elements from v0
tmp & 0xaaaa // checks elements from v1
// Or low/high, costing more setup work
__m128i evenmask = _mm_set1_epi16(0x00ff);
evens = _mm_packus_epi16(_mm_and_si128(v0, evenmask), _mm_and_si128(v1, evenmask));
odds = _mm_packus_epi16(_mm_srli_epi16(v0, 8), _mm_srli_epi16(v1, 8));
...
tmp = movemask(compare_result);
(uint8_t)tmp // checks elements from v0, like cmp al, 0xff to check all-set
tmp>>8 // checks elements from v1, like cmp ah, 0xff or test eax, 0xff00
If you only have 1 vector, you can use v1 = v0
.
For the actual conditional range-checking, use the same setup as your previous question:
// loop-invariant constants, set these up once
__m128i mins = _mm_set1_epi8( min - 0x80);
__m128i rangelen = _mm_set1_epi8( max - (min-0x80) );
__m128i even_key = _mm_set1_epi8( key );
__m128i odds_out_of_range(__m128i evens, __m128i odds, __m128i even_key, __m128i mins, __m128i rangelen)
{
__m128i vsub = _mm_sub_epi8(odds, mins);
__m128i odd_outrange = _mm_cmpgt_epi8(vsub, rangelen);
// or inrange = cmpgt(rangelen, vsub) I think, for an inclusive range?
__m128i veven_match = _mm_cmpeq_epi8(evens, even_key);
// bad only if even matched & odd of range
__m128i bad = _mm_and_si128(veven_match, odd_outrange);
return bad;
}
You might call it from a loop over arrays of range vectors. With AVX you'd have free broadcast-load of 32-bit elements; with just SSE4.2 you only have free broadcast-loads of 64-bit elements via SSE3 movddup
. For simplicity I've just show using full vectors; that lets them work as memory source operands without even needing a separate load instruction.
#include <immintrin.h>
#include <stdint.h>
struct range {
int8_t even_key, min, max;
};
struct vrange {
__m128i even_key, mins, rangelen;
};
// static inline
struct vrange make_vrange(struct range r)
{
// preprocessing to go with our range-check
struct vrange vr;
vr.mins = _mm_set1_epi8( r.min - 0x80);
vr.rangelen = _mm_set1_epi8( r.max - (r.min-0x80) );
vr.even_key = _mm_set1_epi8( r.even_key );
return vr;
}
__m128i odds_out_of_range(__m128i evens, __m128i odds, struct vrange vr)
{
__m128i vsub = _mm_sub_epi8(odds, vr.mins);
__m128i odd_outrange = _mm_cmpgt_epi8(vsub, vr.rangelen);
// or inrange = cmpgt(rangelen, vsub) I think, for an inclusive range?
__m128i veven_match = _mm_cmpeq_epi8(evens, vr.even_key);
// bad only if even matched & odd of range
__m128i bad = _mm_and_si128(veven_match, odd_outrange);
return bad;
}
__m128i check_ranges(__m128i evens, __m128i odds, struct vrange *vranges, int n)
{
// peel at least the first iteration so some range vectors can stay in registers
__m128i problems = odds_out_of_range(evens, odds, vranges[0]);
// maybe peel a total of 2 or 3
// If necessary, create dummy accept-everything ranges in case n < 3
for (int i = 1 ; i<n ; i++){
__m128i bad = odds_out_of_range(evens, odds, vranges[i]);
problems = _mm_or_si128(problems, bad); // accumulate across key/range
}
return problems;
}
int find_first_out_of_range(const int8_t *p, size_t len, const struct range *scalar_ranges, int nr)
{
if (nr > 100) return -2; // assert not too many ranges for a VLA
struct vrange vranges[nr];
for (int i = 0 ; i<nr ; i++) {
vranges[i] = make_vrange(scalar_ranges[i]);
}
// perhaps hoist this preprocessing out of this function?
for (size_t bytepos = 0 ; bytepos<len ; bytepos+=32){
__m128i v0 = _mm_loadu_si128((const __m128i*)&p[bytepos+ 0]);
__m128i v1 = _mm_loadu_si128((const __m128i*)&p[bytepos+16]);
// split to odds/evens once, reusing these across multiple even_key, odd_range pairs
__m128i evens = _mm_blendv_epi8(_mm_slli_epi16(v1, 8), v0, _mm_set1_epi16(0x00ff));
__m128i odds = _mm_blendv_epi8(v1, _mm_srli_epi16(v0, 8), _mm_set1_epi16(0x00ff));
__m128i problems = check_ranges(evens, odds, vranges, nr);
unsigned mask = _mm_movemask_epi8(problems);
if (mask) {
if (mask & 0x5555) // v0
return bytepos;
if (mask & 0xaaaa) // v1
return bytepos + 16;
}
}
return -1;
}
That full example compiles on Godbolt, but untested other than that. It assumes you have an even number of input vectors to process, although you can use v1 = v0
for a left over solo vector.
It does use memory-source operands for some things, but unfortunately with an indexed addressing mode instead of iterating a pointer. So it'll unlaminate to 2 uops on Sandybridge-family Intel CPUs.
SSE4.2 string instructions (like pcmpestrm
) can do multiple range checks, but I don't think they can do it per-element-conditionally on another element. They cost multiple uops (especially the explicit-length versions which are usable with 0
bytes in the data), so probably aren't useful.
In a slightly different case, like if every pair of bytes had to match one of the ranges, you could maybe use pcmpestrm
or pcmpestri
in 16-bit-element mode, with multiple range-checks. e.g. checking for the range 0x0201-0x0205, or the range 0x0304-0x0306 for each 16-bit chunk, and check that every word was part of a range. (You'd need to byte-reverse within 16-bit elements, using pshufb
).
@aqrit posted & deleted an answer attempting that. If they get their idea working and undelete, see their answer. I thought it was interesting enough to be worth a mention as a related case that may not quite work for this one. I think it doesn't work here because the input vector can have an even byte that doesn't match any of the "keys".
Perhaps another pcmpestrm
can make a mask of which 8-bit elements match one of the keys, using it's EQUAL_ANY mode. And you can intersect that with another pcmpestrm
16-bit Ranges check which checks each element for being in one of the input ranges. So if an even byte matched any key, the 16-bit element it's part of must be in one of the ranges.
See https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for some human-readable description of what the instructions can do, less opaque than the formal documentation.