What is the most efficient way to do unsigned 64 bit comparison on SSE2?

Question

PCMPGTQ doesn't exist on SSE2 and doesn't natively work on unsigned integers. Our goal here is to provide backward-compatible solutions for unsigned 64-bit comparisons so we can include them into the WebAssembly SIMD standard.

This is a sister question to this one for ARMv7+NEON: What is the most efficient way to do SIMD unsigned 64 bit comparison (CMHS) on ARMv7 with NEON?

And is related to the questions for signed comparisons variants already answered for SSE2 and Neon:

How to simulate pcmpgtq on sse2?

What is the most efficient way to support CMGT with 64bit signed comparisons on ARMv7a with Neon?

If you found a solution for signed numbers: You might invert the upper bits (bit 63) of both unsigned values and then perform a signed comparison. — Martin Rosenau, Dec 24 '20 at 20:23
Presumably you're going to want to design this primitive so it *can* use AVX512VL `VPCMPUQ k1 {k2}, xmm2, xmm3/m128/m64bcst, imm8` (https://www.felixcloutier.com/x86/vpcmpq:vpcmpuq) if available. That compares into a mask register, not a vector. But if SSE4.2 is available, you'd xor to flip the signs of both inputs for [`pcmpgtq`](https://www.felixcloutier.com/x86/pcmpgtq). It's obviously possible to implement with SSE2, and some HW has it fairly efficiently; does the level of inefficiency for SSE2 really determine whether you choose to include it or not in WebAssembly? — Peter Cordes, Dec 24 '20 at 21:36
@PeterCordes is flipping the 63 bit with xor or subtracting the numbers on both inputs the only way? Or is there another comparison approach that works with unsigned numbers but is different from how we did it last time? — Dan Weber, Dec 25 '20 at 00:39
@njuffa I'm looking to see if there is an approach for SSE2 that wasn't merely an XOR to the inputs. — Dan Weber, Dec 25 '20 at 00:40
If you have `pcmpgtq`, then flipping the high bits of both is almost certainly best. With just SSE2, IDK, likely there is something more clever than building more steps to feed your existing pcmpgtq emulation. My point was that a naive starting-point like that is totally fine, as long as users of WebAssembly understand that it might not be efficient everywhere. (Unless there's zero hardware in the world that can do it in one instruction, not even AArch64 with or without SVE; Although even if your model would require AVX512VL to get the result back into a vector, that could be optimized out) — Peter Cordes, Dec 25 '20 at 01:33
e.g. Agner Fog's VCL library of C++ wrappers provides an orthogonal set of operations for all vector types, including 64-bit unsigned elements. Even though that means some have to be emulated with current ISA extensions. e.g. for this problem: https://github.com/vectorclass/version2/blob/master/vectori128.h#L4536 shows the various implementations of `Vec2uq` `operator >` for SSE2, SSE4.2, AMD XOP, and AVX512VL. Spoiler alert: they do all involve flipping the high bit with XOR, except when single-instruction direct support exists. — Peter Cordes, Dec 25 '20 at 01:34

score 3 · Accepted Answer · edited Dec 26 '20 at 22:27

Translated from Hacker's Delight:

static
__m128i sse2_cmpgt_epu64(__m128i a, __m128i b) {
    __m128i r = _mm_andnot_si128(_mm_xor_si128(b, a), _mm_sub_epi64(b, a));
    r = _mm_or_si128(r, _mm_andnot_si128(b, a));
    return _mm_shuffle_epi32(_mm_srai_epi32(r, 31), _MM_SHUFFLE(3,3,1,1));
}

Concept: If mixed "signs" (unsigned MSBs) then return a else return b - a

(MSB(a) ^ MSB(b)) ? a : b - a; // result in MSB

This makes sense:

if a's MSB is set and b's isn't, a is unsigned above (so MSB(a) is our result)
if b's MSB is set and a's isn't, a is unsigned below (so MSB(a) is our result)
if their MSBs are the same, they values are in the same half of the unsigned range so b-a is effectively a 63-bit subtraction. The MSBs will cancel and the MSB of b-a will be equal to the "borrow" output which tells you if a is strictly above b. (Like the CF flag for scalar sub. jb is jc). So MSB(b-a) is our result.

Note that the SIMD andnot/and/or is a bit-blend, but we only care about the MSB. We broadcast it with srai -> shuffle_epi32, discarding the garbage in lower bits. (Or with SSE3, movshdup as described in @Soont's answer.)

It differs from signed compare:

(MSB(a) ^ MSB(b)) ? ~a : b - a; // result in MSB

If signs are mixed then the sign of ~a is also the sign of b, of course.

`(a ^ b) ?` in C is a logical test of the whole result for being non-zero. Using an `_mm_xor` result as a blend control will instead bit-blend, not select between either of two whole results. Is that what this notation means in Hacker's Delight? Anyway, are you sure that works, and you don't need to `pcmpeqq` (or equivalent) against zero first? — Peter Cordes, Dec 26 '20 at 22:11
Oh right, you're broadcasting the sign bit after selecting, so you only care about blending the top bit. And your `a^b` condition is actually `MSB(a) ^ MSB(b)`, not whether the whole numbers is equal at all bit-positions, making the whole 64-bit xor == 0. This is *not* what the expression would mean in C. — Peter Cordes, Dec 26 '20 at 22:18
Edited to include a more detailed explanation of how/why it works. — Peter Cordes, Dec 26 '20 at 22:27

Soonts · Answer 2 · 2020-12-26T19:45:50.167

Here you go.

__m128i cmpgt_epu64_sse2( __m128i a, __m128i b )
{
    // Compare uint32_t lanes for a > b and a < b
    const __m128i signBits = _mm_set1_epi32( 0x80000000 );
    a = _mm_xor_si128( a, signBits );
    b = _mm_xor_si128( b, signBits );
    __m128i gt = _mm_cmpgt_epi32( a, b );
    __m128i lt = _mm_cmpgt_epi32( b, a );

    // It's too long to explain why, but the result we're after is equal to ( gt > lt ) for uint64_t lanes of these vectors.
    // Unlike the source numbers, lt and gt vectors contain a single bit of information per 32-bit lane.
    // This way it's much easier to compare them with SSE2.

    // Clear the highest bit to avoid overflows of _mm_sub_epi64.
    // _mm_srli_epi32 by any number of bits in [ 1 .. 31 ] would work too, only slightly slower.
    gt = _mm_andnot_si128( signBits, gt );
    lt = _mm_andnot_si128( signBits, lt );

    // Subtract 64-bit integers; we're after the sign bit of the result.
    // ( gt > lt ) is equal to extractSignBit( lt - gt )
    // The above is only true when ( lt - gt ) does not overflow, that's why we can't use it on the source numbers.
    __m128i res = _mm_sub_epi64( lt, gt );

    // Arithmetic shift to broadcast the sign bit into higher halves of uint64_t lanes
    res = _mm_srai_epi32( res, 31 );
    // Broadcast higher 32-bit lanes into the final result.
    return _mm_shuffle_epi32( res, _MM_SHUFFLE( 3, 3, 1, 1 ) );
}

Here’s a test app.

If SSE3 is available, movshdup is also a good option instead of the pshufd (_mm_shuffle_epi32) to copy the srai result to the low dword in each element. (Or optimize that away if the next use is movmskpd or something else which only depends on the high part of each qword).

For example, on Conroe/Merom (first gen Core 2, SSSE3 and most SIMD execution units are 128-bit wide but the shuffle unit has limitations), pshufd is 2 uops, 3 cycle latency (flt->int domain). movshdup is only 1 uop, 1 cycle latency because its hard-wired shuffle is only within each 64-bit half of a register. movshdup runs in the "SIMD-int" domain so it doesn't cause any extra bypass latency between integer shift and whatever integer thing you do next, unlike pshufd. (https://agner.org/optimize/)

If you're JITing, you'd only ever be using this on CPUs without SSE4.2, which means Intel before Nehalem, AMD before Bulldozer. Note that psubq (_mm_sub_epi64) is somewhat slower than narrower psub on some of those CPUs, but it's still the best option.

For completeness, here’s SSSE3 version (not quite the same thing as SSE3), saves a few instructions at the cost of a constant load. The only way to find out if it’s faster or slower — test on old computers.

__m128i cmpgt_epu64_ssse3( __m128i a, __m128i b )
{
    // Compare uint32_t lanes for a > b and a < b
    const __m128i signBits = _mm_set1_epi32( 0x80000000 );
    a = _mm_xor_si128( a, signBits );
    b = _mm_xor_si128( b, signBits );
    __m128i gt = _mm_cmpgt_epi32( a, b );
    __m128i lt = _mm_cmpgt_epi32( b, a );

    // Shuffle bytes making two pairs of equal uint32_t values to compare.
    // Each uint32_t combines two bytes from lower and higher parts of the vectors.
    const __m128i shuffleIndices = _mm_setr_epi8(
        0, 4, -1, -1,
        0, 4, -1, -1,
        8, 12, -1, -1,
        8, 12, -1, -1 );
    gt = _mm_shuffle_epi8( gt, shuffleIndices );
    lt = _mm_shuffle_epi8( lt, shuffleIndices );
    // Make the result
    return _mm_cmpgt_epi32( gt, lt );
}

Is this just [How to simulate pcmpgtq on sse2?](https://stackoverflow.com/q/65166174) with sign bits flipped? I guess not quite; you're flipping sign bits of each *32*-bit chunk, not 64. (The 64-bit signed compared would have to range-shift the low halves of each qword to from unsigned signed.) So there is some saving vs. flipping just high bits and feeding to signed compare, which compilers might or might not have optimized away for us. — Peter Cordes, Dec 25 '20 at 19:40
psrai res,31 => pshufd might be better than srli / psubq. `pshufd` is a copy-and-shuffle and you wouldn't need a zero to sub from. (OTOH, some old CPUs without SSE4.2 have slow shuffle units as described in [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/a/35270026) making pshufb slower. But some old CPUs also have slow psubq, IIRC; have to check on that.) — Peter Cordes, Dec 25 '20 at 20:05
@PeterCordes Good idea, updated. `pshufd` is 1 latency instruction in all recent CPUs. That 32-bit arithmetic shift, `psrad`, is not, before Skylake and Ryzen it had 2 cycles latency. However, it still makes sense to do that because what I wrote about `psrad` equally applies to `psrlq`, and `psrad` saves `pxor` instruction used by `_mm_setzero_si128` — Soonts, Dec 25 '20 at 20:56
Remember the real goal of this question is a *JIT* fallback when SSE4.2 is not available, otherwise you do 2x pxor / pcmpgtq. Given that WebAssembly is JITed, there's no chance of this being used on Sandybridge-family or Bulldozer / Zen. (Unless someone else uses it for ahead-of-time compilation without dynamic dispatching, or someone configures a VM badly to not advertise SSE4 to a guest...) — Peter Cordes, Dec 26 '20 at 01:30
Also, maybe you're looking at `psrad` with a vector count? `psrad (xmm, i8)` has 1c latency on Conroe and Sandybridge. (https://uops.info). Or if you're looking at K8/K10 and Bulldozer-family, *all* SIMD instructions have at least 2c latency. (https://agner.org/optimize/). `pshufd` isn't too bad on really old Intel (Conroe 2 uops) or AMD (K8), and psubq is 2 uops on Conroe. — Peter Cordes, Dec 26 '20 at 01:37
@PeterCordes Not sure `movshdup` is good. These old CPUs have couple cycles cross-domain latency. One shouldn’t use float instructions to process integers or vice versa, need to pass vectors between int and FP EUs, and on old CPUs that takes time. — Soonts, Dec 26 '20 at 03:13
Also, for completeness, here’s SSSE3 version, might be slightly faster, I don’t know if that’s the case. https://gist.github.com/Const-me/5999328a743128a86f7f5a93d07f2463 — Soonts, Dec 26 '20 at 03:13
Oh that's fun, yeah worth adding to your answer. Looks good for Wolfdale/Penryn if loading another constant doesn't cache-miss. The first constant probably already gets loaded, although JIT could choose to construct it on the fly in 2 insns. (Have to benchmark Conroe/Merom; 4-uop pshufb probably sucks, but can decode it along with others. K10 and K8 lack SSSE3. Bobcat has 6-uop pshufb vs. 2-uop movshdup and psubq (same as everything else with it's half-width execution units). In-order Atom has slow pshufb, but also slow psubq, and it's a decode stall so any multi-uop insn is a big cost.) — Peter Cordes, Dec 26 '20 at 03:54

What is the most efficient way to do unsigned 64 bit comparison on SSE2?

2 Answers2