SSE4.1 unsigned integer comparison with overflow

Question

Is there any way to perform a comparison like C >= (A + B) with SSE2/4.1 instructions considering 16 bit unsigned addition (_mm_add_epi16()) can overflow?

The code snippet looks like-

#define _mm_cmpge_epu16(a, b) _mm_cmpeq_epi16(_mm_max_epu16(a, b), a)

__m128i *a = (__m128i *)&ptr1;
__m128i *b = (__m128i *)&ptr2;
__m128i *c = (__m128i *)&ptr3;
            
_m128i xa = _mm_lddqu_si128(a);
_m128i xb = _mm_lddqu_si128(b);
_m128i xc = _mm_lddqu_si128(c);

_m128i res = _mm_add_epi16(xa, xb);
_m128i xmm3 = _mm_cmpge_epu16(xc, res);

The issue is that when the 16 bit addition overflows (wraps-around), the greater than comparison results in false positives. I can't use saturated addition for my purpose. I have looked at mechanism to detect overflow for unsigned addition here SSE2 integer overflow checking. But how how do I use if for greater than comparision.

I think you should first check for overflow as per the question you linked. If you do detect an overflow, you know that `C > (A + B)` is false. Otherwise, check that next. Since you are doing vectors, you might have to perform both checks and merge them using bitwise operations. (Edited to fix reversed condition). — Jester, Dec 17 '20 at 15:43
Do you want to check `C > (A+B)` or `C >= (A+B)`? In the first case, I don't see how adding with saturation leads to false positives. — chtz, Dec 17 '20 at 15:53
I think `C-A >= B` (with saturated subtraction) should work (not tested). Edit: No it does not (need to think more about it) — chtz, Dec 17 '20 at 16:05
@chtz: if we know `C` can't be `0xffff`, does it help to do saturating `A+B` then range-shift both C and the sum to signed (by flipping their sign bits with `pxor`) for `pcmpgtw`? But if it has to work for C = 0xffff, same as the saturation result, I don't think that helps. — Peter Cordes, Dec 18 '20 at 06:32
@PeterCordes Yes, and the `C-A >= B` trick would work if one of `B>0` or `C>=A` is guaranteed. (Similar for `C-B >= A`, of course). One could check `C-min(A,B) >= max(A,B)` which would be 5 uops, if I count correctly. — chtz, Dec 18 '20 at 08:45

EOF · Answer 1 · 2020-12-17T16:52:14.340

Here are a few reasonable approaches:

#include <cstdint>
using v8u16 = uint16_t __attribute__((vector_size(16)));

v8u16 lthsum1(v8u16 a, v8u16 b, v8u16 c) {
    return (c >= a) & (c - a >= b);
}

v8u16 lthsum2(v8u16 a, v8u16 b, v8u16 c) {
    return (a + b >= a) & (a + b <= c);
}

You can see how this gets compiled on godbolt. Both approaches are broadly equivalent, and I'm not seeing large changes with -msse4.1 with gcc, but AVX2 and later do improve the code. clang also gets minor improvements with sse4.1 for the second variant. With AVX512BW, clang does pretty well for itself.

score 2 · Accepted Answer · answered Dec 17 '20 at 17:11

You build the missing primitives from what you have available in the instruction set. Here’s one possible implementation, untested. Disassembly.

// Compare uint16_t lanes for a >= b
inline __m128i cmpge_epu16( __m128i a, __m128i b )
{
    const __m128i max = _mm_max_epu16( a, b );
    return _mm_cmpeq_epi16( max, a );
}

// Compare uint16_t lanes for c >= a + b, with overflow handling
__m128i cmpgeSum( __m128i a, __m128i b, __m128i c )
{
    // Compute c >= a + b, ignoring overflow issues
    const __m128i sum = _mm_add_epi16( a, b );
    const __m128i ge = cmpge_epu16( c, sum );

    // Detect overflow of a + b
    const __m128i sumSaturated = _mm_adds_epu16( a, b );
    const __m128i sumInRange = _mm_cmpeq_epi16( sum, sumSaturated );

    // Combine the two
    return _mm_and_si128( ge, sumInRange );
}

Just a corner case to be handled when c is 0xFFFF and a+b overflows. — Kaustubh, Dec 18 '20 at 15:06
Thanks. It was an issue with my test. It's acceptance answer for me. — Kaustubh, Dec 19 '20 at 12:56

SSE4.1 unsigned integer comparison with overflow

2 Answers2