4

Consider a 256-bit register containing four 64-bit integers. Is it possible in AVX/AVX2 to test efficiently whether some of these integers are equal?

E.g:

a) {43, 17, 25, 8}: the result must be false because no 2 of the 4 numbers are equal.

b) {47, 17, 23, 17}: the result must be 'true' because number 17 occurs 2 times in the AVX vector register.

I'd like to do this in C++, if possible, but I can drop down to assembly if necessary.

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
  • What's the threshold for "efficiently"? It can be done, but it's quite annoying – harold Jun 15 '17 at 19:21
  • @harold, the threshold is that it must be faster than the same accomplished without SIMD (sequentially). – Serge Rogatch Jun 15 '17 at 19:24
  • 3
    Incidentally this is called "Conflict Detection" and the next thing after AVX2, called AVX-512, has an extension called CDI for just this purpose. Specifically, you want `vpconflictq`. – Iwillnotexist Idonotexist Jun 15 '17 at 20:20
  • @IwillnotexistIdonotexist , that's great news and I also need it for conflict detection: can't perform vector operation if some targets share array index. Instead I have to do them one after another. – Serge Rogatch Jun 15 '17 at 20:32

1 Answers1

7

With AVX512 (AVX512VL + AVX512CD), you would use VPCONFLICTQ, which is designed for this purpose.


For AVX2:

Shaved off a couple of operations by doing fewer redundant comparisons:

int test1(__m256i x)
{
    __m256i x0 = _mm256_permute4x64_epi64(x, 0x4B);
    // 1 0 2 3
    // 3 2 1 0
    __m256i e0 = _mm256_cmpeq_epi64(x0, x);
    __m256i x1 = _mm256_shuffle_epi32(x, 0x4E);
    // 2 3 0 1
    // 3 2 1 0
    __m256i e1 = _mm256_cmpeq_epi64(x1, x);
    __m256i t = _mm256_or_si256(e0, e1);
    return !_mm256_testz_si256(t, _mm256_set1_epi32(-1));
}

Previously:

A simple "compare everything with everything" approach can be used with some shuffles, something like this (not tested):

int hasDupe(__m256i x)
{
    __m256i x1 = _mm256_shuffle_epi32(x, 0x4E);
    __m256i x2 = _mm256_permute4x64_epi64(x, 0x4E);
    __m256i x3 = _mm256_shuffle_epi32(x2, 0x4E);
    // 2 3 0 1
    // 3 2 1 0
    __m256i e0 = _mm256_cmpeq_epi64(x1, x);
    // 1 0 3 2
    // 3 2 1 0
    __m256i e1 = _mm256_cmpeq_epi64(x2, x);
    // 0 1 2 3
    // 3 2 1 0
    __m256i e2 = _mm256_cmpeq_epi64(x3, x);
    __m256i t0 = _mm256_or_si256(_mm256_or_si256(e0, e1), e2);
    return !_mm256_testz_si256(t0, _mm256_set1_epi32(-1));
}

GCC 7 compiles this to reasonable code, but Clang does really strange things. It seems to think that vpor has no 256 bit version (which it totally does). Changing the ORs to additions does roughly the same thing in this case (adding a couple of -1's together will not be zero) and doesn't cause trouble with Clang (also not tested):

int hasDupe(__m256i x)
{
    __m256i x1 = _mm256_shuffle_epi32(x, 0x4E);
    __m256i x2 = _mm256_permute4x64_epi64(x, 0x4E);
    __m256i x3 = _mm256_shuffle_epi32(x2, 0x4E);
    // 2 3 0 1
    // 3 2 1 0
    __m256i e0 = _mm256_cmpeq_epi64(x1, x);
    // 1 0 3 2
    // 3 2 1 0
    __m256i e1 = _mm256_cmpeq_epi64(x2, x);
    // 0 1 2 3
    // 3 2 1 0
    __m256i e2 = _mm256_cmpeq_epi64(x3, x);
    // "OR" results, workaround for Clang being weird
    __m256i t0 = _mm256_add_epi64(_mm256_add_epi64(e0, e1), e2);
    return !_mm256_testz_si256(t0, _mm256_set1_epi32(-1));
}
Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
harold
  • 61,398
  • 6
  • 86
  • 164
  • So it's 8 vector operations, while naive solution for pairwise comparison would require 4*3/2 = 6 comparisons... But the number of operations in the latter may be substantially more than 6 in assembly because of array offsetting. While vector operations almost map to the same in assembly, I guess. So it needs measurements. I've moved your AVX512 solution to the top, if you don't mind. – Serge Rogatch Jun 16 '17 at 09:35
  • @SergeRogatch probably a more significant difference is that the results of the comparisons have to be used/merged somehow - for example `setcc` them and `or` them all together (and then branch on it I guess?), which would add 6 `setcc`'s and either 5 or 2 `or`'s (2 if cheating with high-byte registers, so it also costs 3 extra recombination µops, no free lunch there). That about doubles the instruction count for a scalar version – harold Jun 17 '17 at 14:49