0

I have a case to compare two 128-bit unsigned long long a, b on my computer (i7-11700). I need to find out whether a is greater than or equal to b or not. (a >= b) I try to use AVX2 first. I divide the 128-bit value into four part. ( e.g. a -> a1 a2 a3 a4 , b -> b1 b2 b3 b4 ) Each part is 32-bit of 128-bit. With AVX2, I try with the following code.

maxVal = _mm256_set1_epi32(0xFFFFFFFF);
stop = _mm256_set1_epi32(0);
maskgt = _mm256_cmpgt_epu32_mask(a1, b1);
maskeq = _mm256_cmpeq_epi32_mask(a1, b1);
maskbl = _mm256_or_si256(maskeq, stop);
maskge = _mm256_blendv_epi8(maskgt, maskge, maskbl);
stop = _mm256_blendv_epi8(maxVal, stop, maskbl);
maskgt = _mm256_cmpgt_epu32(a2, b2);
maskeq = _mm256_cmpeq_epi32(a2, b2);
maskbl = _mm256_or_si256(maskeq, stop);
maskge = _mm256_blendv_epi8(maskgt, maskge, maskbl);
stop = _mm256_blendv_epi8(maxVal, stop, maskbl);
maskgt = _mm256_cmpgt_epu32(a3, b3);
maskeq = _mm256_cmpeq_epi32(a3, b3);
maskbl = _mm256_or_si256(maskeq, stop);
maskge = _mm256_blendv_epi8(maskgt, maskge, maskbl);
stop = _mm256_blendv_epi8(maxVal, stop, maskbl);
maskgt = _mm256_cmpgt_epu32(a4, b4);
maskeq = _mm256_cmpeq_epi32(a4, b4);
maskbl = _mm256_or_si256(maskeq, stop);
maskge = _mm256_blendv_epi8(maskgt, maskge, maskbl);

Is there a good way to do the comparison with AVX512? I'm confused with the mask registers. Thanks!

  • You know you can compare in 64-bit chunks, right? `pcmpgtq` (`epi64`) has existed since SSE4.2. Also, compare-into-mask (instead of into a `__m256i` vector) already is AVX-512, so is unsigned integer compares. With AVX2, you only had equal and signed-greater. Are you doing this for arrays of 128-bit integers, or a single 128-bit integer? – Peter Cordes Jul 28 '22 at 08:22
  • @PeterCordes I'm sorry to have confused you. I define unsigned-greater by unsigned-max and equal. And there are arrays of 128-bit intergers need to be compared. – chihovrflo Jul 28 '22 at 08:32
  • Do you need to create a mask or use the comparison result in a conventional branch? Also, note that for integers `(a >= b) == !(b > a)`, you may use this to avoid having to do `pcmpgt`+`pcmpeq`. – Andrey Semashev Jul 28 '22 at 11:42
  • @AndreySemashev I need to create a mask for the following code to know how to update the value in some registers. I compare the four 32-bit separately. The result of each part can't determine the whole 128-bit 's relationship. Therefore, I use stop as a mask to help me get the result. It seems that I still need the result of gt and eq. – chihovrflo Jul 28 '22 at 14:52
  • The code in your question says it's AVX2, but you use intrinsics like `_mm256_cmpgt_epu32_mask` which [require AVX512](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=SSE,SSE2,SSE3,SSSE3,SSE4_2,AVX2,AVX_512,Other&text=_mm256_cmpgt_epu32_mask&ig_expand=1062). Oh, I see, but then the rest of your question uses `_mm256_cmpgt_epu32` which isn't a real intrinsic, something you're defining yourself so this actually costs even more instructions. Probably the `_mask` compares were not your real code, since the rest of it is all AVX2 that needs vectors not `__mmask8`. – Peter Cordes Jul 28 '22 at 16:29
  • See https://stackoverflow.com/q/29742844/634919 for a related question on ARM64, which may give you some ideas. In particular, depending on how many times this has to loop, SIMD may or may not be the right tool. In scalar code, a 128-bit comparison is just two instructions, e.g. `sub rax, [num] ; sbb rdx, [num+8]` and check the carry flag for the result. – Nate Eldredge Aug 05 '22 at 15:24

1 Answers1

2

If you only have two 16 bytes vectors to compare, you don’t need any AVX for that, and the comparison only takes 6 instructions. Here’s the function which returns a boolean, the conditional operator compiles into conditional move i.e. branchless. The code only requires SSE2 which is universally available on 64-bit PCs.

// Compare unsigned 128-bit numbers a >= b
inline bool cmpge_u128( __m128i a, __m128i b )
{
    // Compare unsigned bytes for both a <= b and a >= b
    __m128i i = _mm_min_epu8( a, b );
    __m128i le = _mm_cmpeq_epi8( i, a );
    __m128i ge = _mm_cmpeq_epi8( i, b );
    // Move bitmaps to scalar registers
    uint32_t m1 = (uint32_t)_mm_movemask_epi8( le );
    uint32_t m2 = (uint32_t)_mm_movemask_epi8( ge );
    // Compute the result
    return m2 >= m1;
}

If you have many vectors and want 16 bytes results, it’s a bit more complicated. Here’s AVX2 example, probably portable to wider 64 bytes vectors. It returns 32 bytes vector with 16 bytes lanes set according to the result of the comparison.

// Compare two unsigned 128-bit numbers a >= b
inline __m256i cmpge_epu128( __m256i a, __m256i b )
{
    // Compare uint32_t lanes for both a <= b and a >= b
    __m256i tmp = _mm256_min_epu32( a, b );
    __m256i le = _mm256_cmpeq_epi32( tmp, a );
    __m256i ge = _mm256_cmpeq_epi32( tmp, b );
    
    // Shuffle bytes to gather results for complete 16-byte pieces
    // The le/ge vectors then have identical values in uint32_t lanes
    __m256i perm = _mm256_set1_epi32( 0x0C080400 );
    le = _mm256_shuffle_epi8( le, perm );
    ge = _mm256_shuffle_epi8( ge, perm );

    // Compare uint32_t lanes for ge >= le
    tmp = _mm256_min_epu32( le, ge );
    return _mm256_cmpeq_epi32( tmp, le );
}

However, I don’t have any experience with AVX512. It’s possible someone else can add another answer with a better version leveraging AVX512 instructions not available in AVX2.

Soonts
  • 20,079
  • 9
  • 57
  • 130
  • Thx for ur reply. I take some time to understand ur code and the idea helps me think a lot. Actually, I have a batch of unsigned 128-bit numbers need to compare with a certain vector B. Therefore, I try to pack them as A then using AVX512 (or AVX2) to gain the advantage of parallelism. – chihovrflo Jul 28 '22 at 13:41
  • @chihovrflo The key question for you to consider, what do you want to do with the results. My parallelized AVX2 version outputs a vector of 256 bits, which only contains 2 bits of data: the other 254 bits in the vector are redundant. – Soonts Jul 28 '22 at 14:05
  • In the code I showed upon, I use blend with the final mask to determine another register to update the vector with value1 or value2. So it may looks like `if (A>=B) C=D else C=E ` – chihovrflo Jul 28 '22 at 14:36
  • @chihovrflo In this case, you do need a vector with all these redundant bits, like the one computed by the AVX2 function above. Use `_mm256_blendv_epi8` to select afterwards. – Soonts Jul 28 '22 at 15:13
  • @chihovrflo: AVX512 can blend on a bitmap; it would likely be able to do better than this. You just need a bitmap with 2 or 4 bits per int128 to use with qword or dword blends. If your question showed what you're actually trying to do (blend int128 arrays based on a compare of two other int128 arrays) (and preferably with valid AVX2), I might have a look at answering it. – Peter Cordes Jul 29 '22 at 20:08
  • As I mentioned, a batch of uint128 need to be compared with a certain uint128. Due to the result of comparison, each uint128 can get a different value. I try to use AVX2 first to optimize it. Because we don't have cmp instructions for 128bit, I devide one 128bit to two 64bit. But the unsigned cmp still bother me. I then try devide 1 128bit to 4 32bit (like the code showed upon). Although it costs more instrcutions, I deal with 8 128bit in each iteration. But the performance still not good enough, so I try to use AVX512 to do the similar things(I'm unfamiliar with it). Thx for u guys' sharing. – chihovrflo Jul 30 '22 at 07:57