5

Please tell me, I can't figure it out myself:

Here I have __m128i SIMD vector - each of the 16 bytes contains the following value:

1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1

Is it possible to somehow transform this vector so that all ones are removed, and the place of zeros is the number of the element in the vector of this zero. That is, like this:

0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15
                                                            
1   0   1   1   0   1   0   1   1   1   0   1   0   1   0   1
                                                            
    1           4       6               10      12     14   

And finally get a vector with only these values:

1  4  6  10  12  14

What logic can be to obtain such a result? What SIMD instructions should be used?

PS: I'm just starting to learn SIMD - so I don't know much. and I don't understand.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Optimus1
  • 444
  • 2
  • 13
  • 2
    It's possible. Is AVX512VBMI2 available? That would make it very easy. Without that it's also possible but not as easy. – harold May 03 '22 at 10:53
  • @harold, Unfortunately AVX512 is not available. – Optimus1 May 03 '22 at 10:55
  • So if I get this correctly, you want to zero out the entire vector, but before that, use the position numbers of the initial zeros to form a new vector with these position numbers? – Arkoudinos May 03 '22 at 10:56
  • 2
    What SIMD instruction-set do you have available? x86-64 with AVX2? I guess some kind of x86 as opposed to AArch64 ASIMD, since you said `__m128i`. – Peter Cordes May 03 '22 at 10:56
  • @Arkoudinos, form a new vector with the positions of these zeros. – Optimus1 May 03 '22 at 10:58
  • @Peter Cordes, AVX2 - available. intel processor i7. – Optimus1 May 03 '22 at 10:59
  • Can you assume fast BMI2 `pdep` / `pext`? (Intel Haswell, or AMD Zen3). You might want BMI2 `pext` for the left-pack operation 8 bytes at a time, similar to [AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/q/36932240) - without AVX512, you don't have a SIMD left-pack, and a table of 2^16 `__m128i` shuffle masks for `pshufb` would obviously be terrible. (Or I guess just a `__m128i` of final results after constant-propagation). But anyway, 64K x 16-bytes would be a huge lookup table that would miss in cache pretty much every time. – Peter Cordes May 03 '22 at 11:02
  • If it's just for your own use on your own CPU, then yeah an Intel i7 with AVX2 will have fast `pext`, too, so you can left-pack 8 bytes or 16 nibbles. – Peter Cordes May 03 '22 at 11:03
  • Is the `1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1` vector know at compile-time (or used multiple times)? Or how do you generate that vector? I.e., is it possible to compute that in a different format? Do you need the intermediate result of the masked vector or just the end-result? – chtz May 03 '22 at 11:13
  • @chtz, the data in the vector is unknown at compile time - this data appears later after some SIMD operations. – Optimus1 May 03 '22 at 12:22
  • 2
    It's a bit odd to have `0` and `1`, rather than the usual `0` / `0xff` from compare results. Note that Soonts's answer starts by turning your `1`s into `0xff`s with a compare. If your previous code naturally produces 0/1 instead of 0/-1, that's fine, but if you're doing extra work to get that, don't. – Peter Cordes May 04 '22 at 03:06

3 Answers3

5

If you have BMI2, use the following version.

__m128i compressZeroIndices_bmi2( __m128i v )
{
    const __m128i zero = _mm_setzero_si128();
    // Replace zeros with 0xFF
    v = _mm_cmpeq_epi8( v, zero );

    // Extract low/high pieces into scalar registers for PEXT instruction
    uint64_t low = (uint64_t)_mm_cvtsi128_si64( v );
    uint64_t high = (uint64_t)_mm_extract_epi64( v, 1 );

    // Count payload bytes in the complete vector
    v = _mm_sub_epi8( zero, v );
    v = _mm_sad_epu8( v, zero );
    v = _mm_add_epi64( v, _mm_srli_si128( v, 8 ) );
    v = _mm_shuffle_epi8( v, zero );
    // Make a mask vector filled with 0 for payload bytes, 0xFF for padding
    const __m128i identity = _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 );
    v = _mm_max_epu8( v, identity );
    __m128i mask = _mm_cmpeq_epi8( v, identity );

    // The following line requires C++/20
    // If you don't have it, use #ifdef _MSC_VER to switch between __popcnt64() and _popcnt64() intrinsics.
    uint64_t lowBits = std::popcount( low );
    // Use BMI2 to gather these indices
    low = _pext_u64( 0x0706050403020100ull, low );
    high = _pext_u64( 0x0F0E0D0C0B0A0908ull, high );

    // Merge payload into a vector
    v = _mm_cvtsi64_si128( low | ( high << lowBits ) );
    v = _mm_insert_epi64( v, high >> ( 64 - lowBits ), 1 );

    // Apply the mask to set unused elements to -1, enables pmovmskb + tzcnt to find the length
    return _mm_or_si128( v, mask );
}

Here’s another version without BMI2. Probably slower on most CPUs, but the code is way simpler, and doesn’t use any scalar instructions.

inline __m128i sortStep( __m128i a, __m128i perm, __m128i blend )
{
    // The min/max are independent and their throughput is 0.33-0.5 cycles,
    // so this whole function only takes 3 (AMD) or 4 (Intel) cycles to complete
    __m128i b = _mm_shuffle_epi8( a, perm );
    __m128i i = _mm_min_epu8( a, b );
    __m128i ax = _mm_max_epu8( a, b );
    return _mm_blendv_epi8( i, ax, blend );
}

__m128i compressZeroIndices( __m128i v )
{
    // Replace zeros with 0-based indices, ones with 0xFF
    v = _mm_cmpgt_epi8( v, _mm_setzero_si128() );
    const __m128i identity = _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 );
    v = _mm_or_si128( v, identity );

    // Sort bytes in the vector with a network
    // https://demonstrations.wolfram.com/SortingNetworks/
    // Click the "transposition" algorithm on that demo
    const __m128i perm1 = _mm_setr_epi8( 1, 0, 3, 2, 5, 4, 7, 6, 9, 8, 11, 10, 13, 12, 15, 14 );
    const __m128i blend1 = _mm_set1_epi16( (short)0xFF00 );
    const __m128i perm2 = _mm_setr_epi8( 0, 2, 1, 4, 3, 6, 5, 8, 7, 10, 9, 12, 11, 14, 13, 15 );
    const __m128i blend2 = _mm_setr_epi8( 0, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0 );
    for( size_t i = 0; i < 8; i++ )
    {
        v = sortStep( v, perm1, blend1 );
        v = sortStep( v, perm2, blend2 );
    }
    return v;
}

P.S. If you want the length of the output vector, use this function:

uint32_t vectorLength( __m128i v )
{
    uint32_t mask = (uint32_t)_mm_movemask_epi8( v );
    mask |= 0x10000;
    return _tzcnt_u32( mask );
}
Soonts
  • 20,079
  • 9
  • 57
  • 130
  • 8x 2x pshufb+pmin/maxub + pblendvb seems really expensive, like way worse than the scalar LUT in my answer even with a store-forwarding stall. (I'm kind of surprised this answer is accepted, since it's basically just a demonstration of one way it's possible, but might not even be faster than naive 1-at-a-time scalar on average. Maybe not *that* bad, but still.) – Peter Cordes May 04 '22 at 07:12
  • Isn't there something like 15x `psrldq` and `pminub` we can do, so the low element is the hmin of all 16. Does that work for higher elements? Element #1 is the hmin of the high 15, which may be 0xFF. Element #2 can't ever be below 2, but whether it's the lowest of those elements or the 2nd-lowest depends on whether 1 or both of the lowest 2 elements were 0. (e.g. `0 1 2 5 7 -1 -1 ...` case vs. `2 5 7 9 -1 -1 ...` case) Can we do anything with carry-propagation of `psubq` or `paddq` through qword elements to have a low zero affect higher ones? – Peter Cordes May 04 '22 at 07:16
  • Do your sort loop handle the full general case of elements needing to move in either direction? If so, perhaps it can be simplified since we only care about the low count_zeros elements, and elements are only moving towards the left? Except for the 0xFF elements; those need to move to the right, but they're all identical. – Peter Cordes May 04 '22 at 07:29
  • Is there anything we can do with `vpsarvd` arithmetic variable-right-shift? Maybe no. BTW, `vpblendvb` is *not* single-uop / 1c latency on Intel CPUs; it's 2 uops with 2c latency. If one of the elements in every position was `0xFF`, you could replace it with `vpand`, but that won't be the case most obviously for the all-zero input case where the input is already sorted. Can we usefully branch on popcount(movemask(v)) to see if there are 8 or fewer valid indices, maybe allowing fewer shuffles? But we wouldn't know where they are, or how to get them all into the low 8. – Peter Cordes May 04 '22 at 07:33
  • @PeterCordes Good catch on the latencies, was looking at the SSE4 version. Fixed. – Soonts May 04 '22 at 11:15
  • 1
    @PeterCordes About sorting, indeed, that’s fully general case of sorting 16 bytes. It’s relatively fast to compute prefix or postfix sum over that vector, but it doesn’t really help moving these bytes to the lower positions. – Soonts May 04 '22 at 11:20
  • 2
    @PeterCordes Here’s a pretty efficient prefix sum https://godbolt.org/z/v5vvWTY4j allows to compute destination positions of these bytes: flip 0/1, compute the prefix sum, for each 0 the sum contains destination index. The problem is the lack of the “inverse” `vpshufb` instruction which would take destination indices instead of source indices. – Soonts May 04 '22 at 12:14
  • Interesting idea, and good thought with the prefix-sum, but yeah it does kind of solve a different problem, and I also don't see a way to transform it to something we can use here. It sets up for a (byte) scatter store :/ I guess this kind of shows just how fundamental a building block `vpcompress[b/w/d/q]` is, unless there's some way to do this that we're missing. (And BTW, I like this answer as a proof-of-concept, but I un-upvoted because I don't think it's practically useful for performance compared to mine; I want future readers to see the more efficient way at the top.) – Peter Cordes May 04 '22 at 12:28
  • @Soonts heres you code most likely a faster sorting networking impl: https://godbolt.org/z/4hE6Gx8x9 – Noah May 04 '22 at 14:54
  • 2
    @Noah Thanks, but I think Peter is actually correct about BMI2 being faster than sorting networks, at least for this particular use case. – Soonts May 04 '22 at 15:05
  • Your BMI2 way spends a lot of extra work to get the upper elements to -1 to get the length encoded into the vector itself instead of having that as a separate return value. That could just use popcnt(movemask). (Also, you basically count set elements twice, once scalar for the low half, once with both halves with `psadbw`. Maybe it's faster or at least lower latency to do that separately instead of a `movd` + shift somewhere to get the scalar popcount out of the vector, or vice versa getting a vector from the scalar popcount after shifting. It just seems like it should be avoidable.) – Peter Cordes May 04 '22 at 15:29
  • @PeterCordes I think that work is borderline free due to ILP. They are 7 fast vector instructions (1 cycle/each), independent from the scalar parts. The extract/pext/insert instructions have a few cycles of latency each. The only expensive one is loading the identity vector, but if the OP will call this function in a loop, the optimizer will hopefully inline the function, pre-loading that vector. – Soonts May 04 '22 at 15:57
  • ILP doesn't make stuff free, unless you're doing this between LFENCE instructions, or you're assuming a latency bottleneck. The usual use-case for SIMD is, as you say, in loops where independent work can overlap. If this is left-packing a whole array, not just one vector, hopefully the loop-carried dependency is just a pointer increment, with this independent vector-packing work feeding independent `movqdu` stores. – Peter Cordes May 04 '22 at 16:48
3

Horizontal data-dependent stuff is hard. This is not something traditional SIMD building blocks are good at. This is a tricky problem to start learning SIMD with.

If you had AVX512VBMI2 (Ice Lake), vpcompressb could do this in one instruction on a constant. (Well two, counting the test-into-mask of the input.)
Or with AVX-512BW (Skylake-avx512), you could use vpcompressd on a constant vector of 16x uint32_t and then pack that __m512i down to bytes after compressing with vpmovdb. (After the same test-into-mask of the byte vector).


16 separate elements means a single table-lookup is not viable; 2^16 x __m128i would be 64K x 16-bytes = 1 MiB, most accesses would miss in cache. (The code would be simple though; just _mm_cmpeq_epi8 against zero or _mm_slli_epi32(v, 7) / _mm_movemask_epi8 / use that 16-bit bitmask as an array index).

Possibly 4 lookup of 4-byte chunks using 4 mask bits at a time could work. (With SWAR add of 0x04040404 / 0x08080808 / 0x0c0c0c0c to offset the result). Your table could also store offset values, or you could _lzcnt_u32 or something to figure out how much to increment a pointer until the next store, or _popcnt_u32(zpos&0xf).

#include <stdint.h>
#include <immintrin.h>
#include <stdalign.h>
#include <string.h>

// untested but compiles ok
char *zidx_SSE2(char *outbuf, __m128i v)
{
   alignas(64) static struct __attribute__((packed)) {
       uint32_t idx;
       uint8_t count;  // or make this also uint32_t, but still won't allow a memory-source add unless it's uintptr_t.  Indexing could be cheaper in a PIE though, *8 instead of *5 which needs both base and idx
   }lut[] = { // 16x 5-byte entries
      /*[0b0000]=*/ {0, 0}, /* [0b0001]= */ {0x00000000, 1}, /* [0b0010]= */ {0x00000001, 1 },
      //...  left-packed indices, count of non-zero bits
              /* [0b1111]=*/ {0x03020100, 4}
    };
    // Maybe pack the length into the high 4 bits, and mask?  Maybe not, it's a small LUT

   unsigned zpos = _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
   for (int i=0 ; i<16 ; i+=4){
       uint32_t idx = lut[zpos & 0xf].idx;
       idx += (0x01010101 * i);  // this strength-reduces or is a constant after fully unrolling.  GCC -O2 even realizes it can use add reg, 0x04040404 *as* the loop counter; clang -fno-unroll-loops doesn't
       // idxs from bits 0..3, bits 4..7, bits 8..11, or bits 12..15
       memcpy(outbuf, &idx, sizeof(idx));   // x86 is little-endian.  Aliasing safe unaligned store.
       outbuf += lut[zpos & 0xf].count;  // or popcount(zpos&0xf)
       zpos >>= 4;
   }
   return outbuf;  // pointer to next byte after the last valid idx
}

https://godbolt.org/z/59Ev1Tz37 shows GCC and clang without loop unrolling. gcc -O3 does fully unroll it, as does clang at -O2 by default.

It will never store more than 16 bytes into outbuf, but stores fewer than that for inputs with fewer zero bytes. (But every store to outbuf is 4 bytes wide, even if there were zero actual indices in this chunk.) If all the input vector bytes are 0, the 4 stores won't overlap at all, otherwise they will (partially or fully) overlap. This is fine; cache and store buffers can absorb this easily.

SIMD vectors are fixed width, so IDK exactly what you meant about your output only having those values. The upper bytes have to be something; if you want zeros, then you could zero the outbuf first. Note that reloading from it into a __m128i vector would cause a store-forwarding stall (extra latency) if done right away after being written by 4x 32-bit stores. That's not a disaster, but it's not great. Best to do this into whatever actual output you want to write in the first place.


BMI2 pext is a horizontal pack operation

You said in comments you want this for an i7 with AVX2.
That also implies you have fast BMI2 pext / pdep (Intel since Haswell, AMD since Zen3.) Earlier AMD support those instructions, but not fast. Those do the bitwise equivalent of vpcompressb / vpexpandb on a uint64_t in an integer register.

This could allow a similar trick to AVX2 what is the most efficient way to pack left based on a mask?
After turning your vector into a mask of 0 / 0xf nibbles, we can extract the corresponding nibbles with values 0..15 into the bottom of an integer register with one pext instruction.

Or maybe keep things packed to bytes at the smallest to avoid having to unpack nibbles back to bytes, so then you'd need two separate 8-byte left-pack operations and need a popcnt or lzcnt to figure out how they should overlap.

Your pext operands would be the 0 / 0xff bytes from a _mm_cmpeq_epi8(v, _mm_setzero_si128()), extracted in two uint64_t halves with lo = _mm_cvtsi128_si64(cmp) and hi = _mm_extract_epi64(cmp, 1)`.

Use memcpy as an unaligned aliasing-safe store, as in the LUT version.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
3

Slightly customized from here.

This SSSE3 strategy processes 64-bit words then recombines the result into a 128-bit word. Merging the 64-bit halves in a xmm register is more expensive than a compress-store to memory using overlapping writes.

/// `v` input bytes are either a 1 or 0
/// `v` output bytes are the "compressed" indices of zeros locations in the input
///     unused leading bytes in the output are filled with garbage.
/// returns the number of used bytes in `v`
static inline
size_t zidx_SSSE3 (__m128i &v) {

    static const uint64_t table[27] = { /* 216 bytes */
        0x0000000000000706, 0x0000000000070600, 0x0000000007060100, 0x0000000000070602,
        0x0000000007060200, 0x0000000706020100, 0x0000000007060302, 0x0000000706030200,
        0x0000070603020100, 0x0000000000070604, 0x0000000007060400, 0x0000000706040100,
        0x0000000007060402, 0x0000000706040200, 0x0000070604020100, 0x0000000706040302,
        0x0000070604030200, 0x0007060403020100, 0x0000000007060504, 0x0000000706050400,
        0x0000070605040100, 0x0000000706050402, 0x0000070605040200, 0x0007060504020100,
        0x0000070605040302, 0x0007060504030200, 0x0706050403020100
    };

    const __m128i id = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);

    // adding 8 to each shuffle index is cheaper than extracting the high qword
    const __m128i offset = _mm_cvtsi64_si128(0x0808080808080808);

    // bits[4:0] = index -> ((trit_d * 0) + (trit_c * 9) + (trit_b * 3) + (trit_a * 1))
    // bits[15:7] = popcnt
    const __m128i sadmask = _mm_set1_epi64x(0x8080898983838181);

    // detect 1's (spaces)
    __m128i mask = _mm_sub_epi8(_mm_setzero_si128(), v);
    
    // manually process 16-bit lanes to reduce possible combinations
    v = _mm_add_epi8(v, id);

    // extract bitfields describing each qword: index, popcnt
    __m128i desc = _mm_sad_epu8(_mm_and_si128(mask, sadmask), sadmask);
    size_t lo_desc = (size_t)_mm_cvtsi128_si32(desc);
    size_t hi_desc = (size_t)_mm_extract_epi16(desc, 4);

    // load shuffle control indices from pre-computed table
    __m128i lo_shuf = _mm_loadl_epi64((__m128i*)&table[lo_desc & 0x1F]);
    __m128i hi_shuf = _mm_or_si128(_mm_loadl_epi64((__m128i*)&table[hi_desc & 0x1F]), offset);

    //// recombine shuffle control qwords ////
    // emulates a variable `_mm_bslli_si128(hi_shuf, lo_popcnt)` operation
    desc = _mm_srli_epi16(desc, 7); // isolate popcnts
    __m128i shift = _mm_shuffle_epi8(desc, _mm_setzero_si128()); // broadcast popcnt of low qword
    hi_shuf = _mm_shuffle_epi8(hi_shuf, _mm_sub_epi8(id, shift)); // byte shift left
    __m128i shuf = _mm_max_epu8(lo_shuf, hi_shuf); // merge

    v = _mm_shuffle_epi8(v, shuf);
    return (hi_desc + lo_desc) >> 7; // popcnt
}

If we're extracting these indices just for future scalar processing then we might wish to consider using pmovmskb then peeling off each index as needed.

x = (unsigned)_mm_movemask_epi8(compare_mask);
while (x) {
    idx = count_trailing_zeros(x);
    x &= x - 1; // clear lowest set bit
    DoSomethingTM(idx);
}
aqrit
  • 792
  • 4
  • 14