5

I have two __m256i vectors (each containing chars), and I want to find out if they are completely identical or not. All I need is true if all bits are equal, and 0 otherwise.

What's the most efficient way of doing that? Here's the code loading the arrays:

char * a1 = "abcdefhgabcdefhgabcdefhgabcdefhg";
__m256i r1 = _mm256_load_si256((__m256i *) a1);

char * a2 = "abcdefhgabcdefhgabcdefhgabcdefhg";
__m256i r2 = _mm256_load_si256((__m256i *) a2);
byteSlayer
  • 1,806
  • 5
  • 18
  • 36
  • 1
    This is probably a duplicate. I should probably have gone searching instead of answering this. – Peter Cordes Nov 12 '17 at 01:00
  • Just wondering, do you get significant performance improvements when using intrinsics ( since several posts claim that the compiler can perform most vector optimizations)? – Cpp plus 1 Nov 12 '17 at 01:17
  • 1
    @Cppplus1 sometime if you manually change the algorithm to take advantage of these you can get meaningful improvements beyond what the compiler does – byteSlayer Nov 12 '17 at 02:28
  • 1
    @Cppplus1: auto-vectorization sometimes works well when looping over big arrays, but usually doesn't work for more complicated cases, especially if any shuffling is required. Also, it tends to do a bad job in gcc/clang at least when shuffling to widen or narrow is required. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82356 is just one example of the kind of missed optimization that you often get with gcc auto-vectorization of anything non-trivial to vectorize. – Peter Cordes Nov 12 '17 at 03:05
  • 1
    @Cppplus1: also, depends what you mean by "most vector optimizations". Good luck getting your compiler to auto-vectorize [parsing an IPv4 address with a lookup-table of `pshufb` shuffle-control vectors](https://stackoverflow.com/questions/31679341/fastest-way-to-get-ipv4-address-from-string). There's a lot of crazy stuff you can do with SIMD that's sometimes worth it, and worth thinking about for your use case, that the compiler is *not* going to do for you. Not even Intel's compiler (which is still better at auto-vectorizing than gcc/clang) – Peter Cordes Nov 12 '17 at 03:08

1 Answers1

12

The most efficient way on current Intel and AMD CPUs is an element-wise comparison for equality, and then check that the comparison was true for all elements.

This compiles to multiple instructions, but they're all cheap and (if you branch on the result) the compare+branch even macro-fuses into a single uop.

#include <immintrin.h>
#include <stdbool.h>

bool vec_equal(__m256i a, __m256i b) {
    __m256i pcmp = _mm256_cmpeq_epi32(a, b);  // epi8 is fine too
    unsigned bitmask = _mm256_movemask_epi8(pcmp);
    return (bitmask == 0xffffffffU);
}

The resulting asm should be vpcmpeqd / vpmovmskb / cmp 0xffffffff / je, which is only 3 uops on Intel CPUs.

vptest is 2 uops and doesn't macro-fuse with jcc, so equal or worse than movmsk / cmp for testing the result of a packed-compare. (See http://agner.org/optimize/

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Peter, Since there is no `_mm256_cmpeq_ps()` how could one imitate it on AVX for `float`? – Royi Feb 18 '18 at 09:38
  • 1
    [`_mm256_cmp_ps(a,b, _CMP_EQ_OQ)`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=1825,5214,433,720&text=cmpps), obviously. Strange that Intel's guide doesn't list any predicates baked in to the intrinsic names for AVX, only for SSE and AVX512. AVX did add a bunch of new predicates, so it would have been a lot of new names, though. I didn't test if any compilers provide a `_mm256_cmpeq_ps`, but some might. Or maybe `_mm256_cmpeq_oq_ps`. – Peter Cordes Feb 18 '18 at 09:46
  • @PeterCordes Is there any reason why _mm256_cmpeq_epi32 is used instead of _mm256_cmpeq_epi8/16/64 inyour answer? – hungptit Oct 22 '18 at 04:11
  • 1
    @hungptit: No, for exact equality it makes no difference what the element granularity is, if you don't care about *where* there was a difference. But `pcmpeqq` was only added in SSE4.1, and its AVX encoding always requires a 3-byte VEX prefix to encode those mandatory prefixes, so dword or smaller element size can give you a shorter instruction. I don't think there are any CPUs that support AVX2 where different element sizes have different performance other than code-size though; but dword element size is a good bet: it's probably never going to be worse than some other size, even on KNL. – Peter Cordes Oct 22 '18 at 04:15
  • @PeterCordes Thank a lot for the info. – hungptit Oct 22 '18 at 04:22