3

Is there a fast way to count the number of unique elements in a simd vector (AVX and any SSE) without converting to array? I want to use it in a specific bruteforcer as an optimization so I want it ideally to be as fast as possible.

Currently Iam doing:

// count the number of unique elements
int uniqueCount(v16n a) {
    alignas(16) unsigned char v[16];
    _mm_store_si128((v16n*)v, a);

    int count = 1;
    for(int i = 1; i < 16; i++) {
        int j;
        for(j = 0; j < i; j++)
            if(v[i] == v[j])
                break;

        if(i == j) count++;
    }

    return count;
}
FireTner
  • 65
  • 3
  • For `char` or `short` elements specifically, [SSE4.2 string instructions](https://www.strchr.com/strcmp_and_strlen_using_sse_4.2) like `pcmpestrm` might be a useful building block for comparing each element against all others, to see if there are any matches for it. (But you'd have to exclude it from matching itself, otherwise every element would have a match. Hmm. Between halves of a 32-byte vector?) – Peter Cordes Sep 17 '22 at 20:36
  • Other than that, it's a similar problem to emulating AVX-512 `vpconflictd`: [Fallback implementation for conflict detection in AVX2](https://stackoverflow.com/q/44843518) shuffles and compares each against each, just to find if there are *any* duplicates, without counting them. For 8 elements it takes 8 choose 2 = 28 total compares, so 4 vectors of 8 compares each get that done with some redundancy. `16 choose 2` is 120, or 7.5 vectors of 16 compares each, so that's a lot of work even without trying to count, just detect any. – Peter Cordes Sep 17 '22 at 20:41
  • Perhaps 8 or 15 steps or rotating the vector by 1 (with `palignr`) could get every pairwise comparison done? So you keep the outer loop, but replace the body with vector shuffles / compares. `_mm_sub_epi8` can accumulate the 0 / -1 results from `_mm_cmpeq_epi8` into a count at that element position. – Peter Cordes Sep 17 '22 at 20:44
  • An improvement over that: get two vectors to compare against: original and rotated by 1. Then compare both of those against the original rotated by steps of 2. So it's the same compares, but fewer shuffles. (You do need to rotate back one of the count accumulators to re-align its buckets). – Peter Cordes Sep 18 '22 at 04:38
  • But when there are a set of duplicates, this will count a `n-1` matches in the buckets for each of the duplicates. Clearly the count array requires some processing to get a count of unique elements, and there might be even be ambiguity. e.g. 8 pairs of duplicates for 8 unique elements vs. one set of 4 of the same element (and the other 12 singles) for 13 unique elements? – Peter Cordes Sep 18 '22 at 04:43

1 Answers1

3

Here’s one possible implementation. The code requires SSSE3, SSE 4.1, and slightly benefits from AVX2 when available.

// Count unique bytes in the vector
size_t countUniqueBytes( __m128i vec )
{
    size_t result = 1;
    // Accumulator for the bytes encountered so far, initialize with broadcasted first byte
#ifdef __AVX2__
    __m128i partial = _mm_broadcastb_epi8( vec );
#else
    __m128i partial = _mm_shuffle_epi8( vec, _mm_setzero_si128() );
#endif
    // Permutation vector to broadcast these bytes
    const __m128i one = _mm_set1_epi8( 1 );
    __m128i perm = one;

    // If you use GCC, uncomment following line and benchmark, may or may not help:
    // #pragma GCC unroll 1
    for( int i = 1; i < 16; i++ )
    {
        // Broadcast i-th byte from the source vector
        __m128i bc = _mm_shuffle_epi8( vec, perm );
        perm = _mm_add_epi8( perm, one );
        // Compare bytes with the partial vector
        __m128i eq = _mm_cmpeq_epi8( bc, partial );
        // Append current byte to the partial vector
        partial = _mm_alignr_epi8( bc, partial, 1 );
        // Increment result if the byte was not yet in the partial vector
        // Compilers are smart enough to do that with `sete` instruction, no branches
        int isUnique = _mm_testz_si128( eq, eq );
        result += ( isUnique ? (size_t)1 : (size_t)0 );
    }
    return result;
}
Soonts
  • 20,079
  • 9
  • 57
  • 130
  • To broadcast each byte one at a time, perhaps start with `_mm_setzero_si128()`, and `shuf = _mm_add_epi8(shuf, _mm_set_epi8(1))`. So instead of shift and broadcast the low byte, we add and broadcast a different position. Reduces contention for the shuffle port (especially on Intel CPUs where all these shuffles compete for port 5), and makes two parallel dependency chains for that part of the algo. – Peter Cordes Sep 18 '22 at 16:37
  • @PeterCordes Yeah, probably a good idea even on AMD, because throughput of integer vector addition is 3 (Zen2) or even 4 (Zen3) instructions per clock. Updated. – Soonts Sep 18 '22 at 19:28
  • @PeterCordes BTW, it’s strange how both gcc and clang decided to unroll the complete loop, unless stopped with that `#pragma`. Not sure that was a good choice, IMO too many instructions to decode: https://godbolt.org/z/x3dGd5rd4 – Soonts Sep 18 '22 at 19:55
  • Yeah, I've noticed GCC -O3 being really aggressive with fully-unrolling with `-march=znver*` and also sometimes `-march=skylake`. `-O2` is less aggressive. Interestingly, `clang` (https://godbolt.org/z/M9z7zrvv4) fully unrolls and constant-propagates the shuffle vectors to constants in memory, but only 8 constants like 2,3,2,3,..., and the second 8 steps reuse them. It's doing something with `vpunpcklbw` and `vpunpckhbw` (once each, not per iteration) to make that work. Looks interesting; doing it manually could allow an unroll by 2 and an add with `set1(2)`, for use with the unpackl/h vecs – Peter Cordes Sep 18 '22 at 20:19