8

I need a way to compare values of type __m128i in C++ for a total order between any values of type __m128i. The type of order doesn't matter as long as it establishes a total order between all values of type __m128i. Hence the comparison might be less-than between 128-bit integers or something else entirely as long as is provides a total order.

I tried using the < operator, but that didn't return a bool, but instead seems to compare the vector components of __m128i (i.e. SIMD):

#include <emmintrin.h>

inline bool isLessThan(__m128i a, __m128i b) noexcept {
     // error: cannot convert '__vector(2) long int' to 'bool' in return
     return a < b;
}

Another possibility would be to use memcmp/strcmp or similar, but this would very likely not be optimal. Targeting modern Intel x86-64 CPUs with at least SSE4.2 and AVX2, are there any intrinsics / instructions I could use for such comparisons? How to do it?

PS: Similar questions have been asked for checking equality but not for ordering:

jotik
  • 17,044
  • 13
  • 58
  • 123
  • 3
    An easy solution (with gcc/clang) would be to convert to `__int128` and compare with `<`: https://godbolt.org/z/Kh0Yd5. Not necessarily the most efficient (but this also depends on whether your inputs typically come from registers or memory) – chtz May 28 '19 at 13:42

1 Answers1

5

Here you go.

inline bool isLessThan( __m128i a, __m128i b )
{
    /* Compare 8-bit lanes for ( a < b ), store the bits in the low 16 bits of the
       scalar value: */
    const int less = _mm_movemask_epi8( _mm_cmplt_epi8( a, b ) );

    /* Compare 8-bit lanes for ( a > b ), store the bits in the low 16 bits of the
       scalar value: */
    const int greater = _mm_movemask_epi8( _mm_cmpgt_epi8( a, b ) );

    /* It's counter-intuitive, but this scalar comparison does the right thing.
       Essentially, integer comparison searches for the most significant bit that
       differs... */
    return less > greater;
}

The order is less than ideal ’coz pcmpgtb treats these bytes as signed integers, but you said it’s not important for your use case.


Update: here’s a slightly slower version with uint128_t sort order.

// True if a < b, for unsigned 128 bit integers
inline bool cmplt_u128( __m128i a, __m128i b )
{
    // Flip the sign bits in both arguments.
    // Transforms 0 into -128 = minimum for signed bytes,
    // 0xFF into +127 = maximum for signed bytes
    const __m128i signBits = _mm_set1_epi8( (char)0x80 );
    a = _mm_xor_si128( a, signBits );
    b = _mm_xor_si128( b, signBits );

    // Now the signed byte comparisons will give the correct order
    const int less = _mm_movemask_epi8( _mm_cmplt_epi8( a, b ) );
    const int greater = _mm_movemask_epi8( _mm_cmpgt_epi8( a, b ) );
    return less > greater;
}

We build unsigned compares out of signed by range-shifting the unsigned inputs to signed (by flipping the high bit = subtract 128).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Soonts
  • 20,079
  • 9
  • 57
  • 130
  • I tried this, but it generates god awful assembly doesn't it? Or is that to be expected? – Passer By May 28 '19 at 17:52
  • You can use `-Gv` with MSVC so it uses `__vectorcall`, so your source is still compilable with GCC. (And won't override clang's x86-64 System V calling convention.) – Peter Cordes May 28 '19 at 21:07
  • 1
    Can we do anything useful with `pcmpgtq` and then shuffle + mask the low element with the high element? That's probably more expensive though; `pmovmskb` or `movmskpd` are cheap, at worst 2 uops on bulldozer-family. (The high latency on Bulldozer family for xmm->int hurts equally either way, and both can happen in parallel.) – Peter Cordes May 28 '19 at 21:22