2

I try to achieve performance improvement and made some good experience with SIMD. So far I was using OMP and like to improve my skills further using intrinsics.

In the following scenario, I failed to improve (even vectorize) due to a data dependency of a last_value required for a test of element n+1.

Environment is x64 having AVX2, so want to find a way to vectorize and SIMDfy a function like this.

inline static size_t get_indices_branched(size_t* _vResultIndices, size_t _size, const int8_t* _data) {
    size_t index = 0;
    int8_t last_value = 0;
    for (size_t i = 0; i < _size; ++i) {
        if ((_data[i] != 0) && (_data[i] != last_value)) {
            // add to _vResultIndices
            _vResultIndices[index] = i;
            last_value = _data[i];
            ++index;
        }
    }
    return index;
}

Input is an array of signed 1-byte values. Each element is one of <0,1,-1>. Output is an array of indices to input values (or pointers), signalling a change to 1 or -1.

example in/output

in: { 0,0,1,0,1,1,-1,1, 0,-1,-1,1,0,0,1,1, 1,0,-1,0,0,0,0,0, 0,1,1,1,-1,-1,0,0, ... }
out { 2,6,7,9,11,18,25,28, ... }

My first attempt was, to play with various branchless versions and see, if auto vectorization or OMP were able to translate it into a SIMDish code, by comparing assembly outputs.

example attempt

int8_t* rgLast = (int8_t*)alloca((_size + 1) * sizeof(int8_t));
rgLast[0] = 0;

#pragma omp simd safelen(1)
for (size_t i = 0; i < _size; ++i) {
    bool b = (_data[i] != 0) & (_data[i] != rgLast[i]);
    _vResultIndices[index] = i;
    rgLast[i + 1] = (b * _data[i]) + (!b * rgLast[i]);
    index += b;
}

Since no experiment resulted in SIMD output, I started to experiment with intrinsics with the goal to translate the conditional part into a mask.

For the != 0 part that's pretty straight forward:

__m256i* vData = (__m256i*)(_data);
__m256i vHasSignal = _mm256_cmpeq_epi8(vData[i], _mm256_set1_epi8(0)); // elmiminate 0's

The conditional aspect to test against "last flip" I found not yet a way.

To solve the following output packing aspect I assume AVX2 what is the most efficient way to pack left based on a mask? could work.

Update 1

Digging deeper into this topic reveals, it's beneficial to separate the 1/-1's and get rid of the 0's. Luckily in my case, I can directly grab this from pre-processing and skip processing to <1,0,-1> using _mm256_xor_si256's for example, having 2 input vectors separated as gt0 (all 1's) and lt0 (all -1's). This also allows 4 times tighter packing of data.

I might want to end up with a process like this Process The challenge now is how to create the transition mask based on gt0 and lt0 masks.

Update 2

Appearently an approach of splitting 1's and -1's into 2 streams (see in answer how), introduces a dependency while acessing elements for scanning alternating: How to efficiently scan 2 bit masks alternating each iteration

Creation of a transition mask as @aqrit worked out using
transition mask = ((~lt + gt) & lt) | ((~gt + lt) & gt) is possible. Eventhough this adds quite some instructions, it apears to be a beneficial tradeoff for data dependency elimination. I assume the gain grows the larger a register is (might be chip dependent).

Update 3

By vectorizing transition mask = ((~lt + gt) & lt) | ((~gt + lt) & gt) I could get this output compiled

vmovdqu     ymm5,ymmword ptr transition_mask[rax]  
vmovdqu     ymm4,ymm5  
vpandn      ymm0,ymm5,ymm6  
vpaddb      ymm1,ymm0,ymm5  
vpand       ymm3,ymm1,ymm5  
vpandn      ymm2,ymm5,ymm6  
vpaddb      ymm0,ymm2,ymm5  
vpand       ymm1,ymm0,ymm5  
vpor        ymm3,ymm1,ymm3  
vmovdqu     ymmword ptr transition_mask[rax],ymm3

On first look it appears efficient compared to potential condition related pitfalls of post-processing (vertical scan + append to output), although it appears to be right and logical to deal with 2 streams instead of 1.

This lacks the ability to generate the initial state per cycle (transition from 0 to either 1 or -1). Not sure if there is a way to enhance the transition_mask generation "bit twiddling", or use auto initial _tzcnt_u32(mask0) > _tzcnt_u32(mask1) as Soons uses here: https://stackoverflow.com/a/70890642/18030502 which appears to include a branch.

Conclusion

The approach @aqrit shared using an improved bit-twiddling solution per chunk load to find transitions, turns out to be the most runtime performant. The hot inner loop is just 9 asm instructions long (per 2 found items for comparison with other approaches) using tzcnt and blsr like this

tzcnt       rax,rcx  
mov         qword ptr [rbx+rdx*8],rax  
blsr        rcx,rcx  
tzcnt       rax,rcx  
mov         qword ptr [rbx+rdx*8+8],rax  
blsr        rcx,rcx  
add         rdx,2  
cmp         rdx,r8  
jl          main+2580h (...)  
Chris G.
  • 816
  • 2
  • 15
  • You have to show us what you have tried so far. This problem is close to finding the maximum index so maybe [this](https://stackoverflow.com/questions/23590610/find-index-of-maximum-element-in-x86-simd-vector) can help. – Shahriar Jan 25 '22 at 18:22
  • Thanks @Shahriar for pointing that out. I have added more input and checked your link. It's interesting since it's handling horizontal aggregation operations. I don't think my scenario is a sorting related issue, since the order remains. To my understanding it appears to me like n-times copy_if over vectors (without a branch) to me, or did I get you wrong? – Chris G. Jan 25 '22 at 19:36
  • 1
    `rgLast[i + 1] = (b * _data[i]) + (!b * rgLast[i]);` is nasty; just use a ternary. Modern GCC/clang understand it at least as well; if anything this might mislead them into doing multiplies. Lack of vectorization is due to the serial dependency of each element on the previous, not because of any problem expressing the select logic. – Peter Cordes Jan 25 '22 at 20:49
  • 1
    This is hard especially without AVX-512 for stuff like `vpconflictd` and `vpcompressd` (or `b`). Although even `vpconflictd` isn't the right horizontal operation. Discarding an element or not depends on some previous element an unbounded distance away, so you can't even just `xor(load(data+i), load(data+i+1)) == -2` or something, although that may be the best way to detect a transition / edge. – Peter Cordes Jan 25 '22 at 20:53
  • 1
    Perhaps a prefix-sum type of shuffle pattern plus blends could propagate -1 or +1 through any zeros, or even find transitions. That would still leave the problem of left-packing indices once you found them. You might not be gaining anything over scalar at that point. Whatever you do, you almost certainly don't want a tmp array of the full size of the input. (Hopefully that was just part of an experiment to see if you could get any part of it to vectorize.) – Peter Cordes Jan 25 '22 at 20:55
  • Thanks @PeterCordes, `rgLast[i + 1] = (b * _data[i]) + (!b * rgLast[i]);` serves just as an example, as one of many variations to visualize C'ish the attempt to get rid of the branch at least. In my case MSVC/Win (could not yet get LLVM compatible with the entrire project) translates ternary to `movzx` which helps prevent pipeline stalls, but not increase performance. I am happy you shared ideas to start from, since I am unexperienced with SIMD op's at all. I have read about the prefix-sum approach in one of your posts already and will experiment with that. – Chris G. Jan 25 '22 at 21:07
  • Introducing a tmp is indeed just part of experimenting incl check asm output / profiling, trying to express my desire to the compiler in a more vector-style. – Chris G. Jan 25 '22 at 21:27
  • 1
    Multi-threading should be easy for this part. The transitions self-synchronize *after* the first potential transition in each arbitrarily sized memory chunk. The serial sync point is when the results from each thread are copied back together.. just conditionally skip the first transition index, depending on the state of the preceding chunk(s). – aqrit Feb 07 '22 at 21:55

2 Answers2

2

Complete vectorization is suboptimal for your case. It’s technically possible, but I think the overhead of producing that array of uint64_t values (I assume you’re compiling for 64 bit CPUs) will eat all the profit.

Instead, you should load chunks of 32 bytes, and immediately convert them to bit masks. Here’s how:

inline void loadBits( const int8_t* rsi, uint32_t& lt, uint32_t& gt )
{
    const __m256i vec = _mm256_loadu_si256( ( const __m256i* )rsi );
    lt = (uint32_t)_mm256_movemask_epi8( vec );
    const __m256i cmp = _mm256_cmpgt_epi8( vec, _mm256_setzero_si256() );
    gt = (uint32_t)_mm256_movemask_epi8( cmp );
}

The rest of your code should deal with these bitmaps. To find first non-zero element (you only need to do that at the start of your data), scan for least significant set bit in (lt | gt) integer. To find -1 number, scan for least significant set bit in lt integer, to find +1 number scan for least significant set bit in gt integer. Once found and handled, you can either clear low portion of both integers with bitwise AND, or shift them both to the right.

CPUs have BSF instruction which scans for the lowest set bit in an integer, and returns two things at once: a flag telling if the integer was zero, and the index of that set bit. If you’re using VC++ there’s _BitScanForward intrinsic, otherwise use inline ASM, that instruction is only exposed in VC++; GCC’s __builtin_ctz is not quite the same thing, it only returns a single value instead of two.

However, on AMD CPUs, the TZCNT instruction from BMI 1 set is somewhat faster than the old-school BSF (on Intel they’re equal). On AMD, TZCNT will probably be slightly faster, despite the extra instruction to compare with 0.

Soonts
  • 20,079
  • 9
  • 57
  • 130
  • Thanks a bunch @Soonts, this perfectly answers my original question! You made me curious about the full SIMD approach to solve this under the assumption, the masks `lt`and `gt`would not have to be produced and are present as instant input. – Chris G. Jan 26 '22 at 20:53
  • 1
    If you have AVX2, you (always?) also have BMI1, so you can use `tzcnt` which is more efficient on AMD (1 uop vs. 7 IIRC for `bsf`). Its FLAGS result is based on the output, not the input, so you'd need a separate test for zero input, but since test/jz can macro-fuse on Intel and AMD that's no more expensive than a `jz` on FLAGS from `bsf`. If you have BMI2 as well as BMI1, shifting both right can use an efficient `shrx`. Without that on Intel, 2x 3 uop variable-count shifts might be worth than constructing a mask. Or if you can clear the lowest set in each, that's just BMI1 `blsr` `x&=x-1` – Peter Cordes Jan 27 '22 at 00:23
  • 1
    `transition mask = ((~lt + gt) & lt) | ((~gt + lt) & gt)` – aqrit Jan 27 '22 at 01:09
  • 2
    @ChrisG. About full vectorization, `tzcnt` can be emulated somehow. For instance, `x & ( x - 1 )` to isolate the bit, `_mm_cvtepi32_ps` to convert the number to FP32, `_mm_srli_epi32` by 23 bits to grab the exponent, then subtract the 127 bias. Too much overhead, I think scalar bit scan is better. Speaking of which, if you really have a lot of data, consider 64-bit bitmaps instead of 32-bit ones. The performance of both `BSF` and `TZCNT` bit scanning instructions is the same for 32- and 64-bit numbers. – Soonts Jan 27 '22 at 07:12
  • @PeterCordes Good point, updated – Soonts Jan 27 '22 at 07:21
  • @aqrit thats a neat way of transition mask generation. Eventhough it needs a rotation and special treatment for initial transition/decision, this appears to beat efficiency of horizontal scanning approaches like `BSF`or `tzcnt`. This also looks applicable for large registers keeping constant performance. – Chris G. Jan 27 '22 at 07:47
  • 2
    @ChrisG. uint64_t bit scan instructions only take 2-3 cycles of latency. That expressions uses 7 instructions, and takes 4 cycles of latency in the best case. – Soonts Jan 27 '22 at 07:55
  • @Soonts good point! I have not yet tested. My first thought reading the comment was about 256 (AVX2) checks using few instructions. Indeed this requires the packing step using `BSF` / `tzcnt` anyhow as I see no alternative currently. – Chris G. Jan 27 '22 at 08:15
  • Working over `lt` and `gt` sequentially gives result for n and n+1, requires a sort, to decide which of both results goes in which sequence to the result vector. Not yet sure if an intermediate step filling a 0/1 pair permutation mask with 1 pair sort per AVX register might improve performance over a comparison per iteration for sequencing/packing. – Chris G. Jan 27 '22 at 08:22
  • 3
    Somewhat simpler: `transition mask = (lt ^ (lt - gt)) & (gt ^ (gt - lt))` – Falk Hüffner Jan 28 '22 at 10:20
2

Carrying state serially between 64-bit SIMD lanes is more expensive than carrying state serially between 64-bit general purpose registers (gpr).

In practice, lookup-tables (or SIMD left-packing) is limited to processing 8 elements at a time. If the data averages about 6 kept elements per 64 elements then left-packing wastes a lot of processing (especially if we're collecting offsets and not doing a gather operation). If the bitset is dense then consider moving to lookup-tables.

As @Snoots suggests, use SIMD to create 64-bit bitsets and use bitscan intrinsics find the indices of wanted set bits.

Branch misprediction:

Squash the greater-than (gt) and less-than (lt) bitsets into a single bitset using transition_mask = ((~lt + gt) & lt) | ((~gt + lt) & gt) or this simplification from @FalkHüffner transition_mask = (lt ^ (lt - gt)) & (gt ^ (gt – lt)).

The state is carry-in/carry-out for one of the arithmetic ops. I’d be careful using _subborrow_u64 as it is fairly uncommon intrinsic (and is buggy on old compilers).

Which leaves the only remaining branch looping over the bitscan operation. All set bits have to be extracted.. but we can unroll the operation and overshoot to make the branch more predictable. The amount of overshoot needs to be tuned to the expected data set.

Not tested. Asm not inspected.

#include <immintrin.h>
#include <stdint.h>

static inline
uint64_t get_mask (int8_t* src, unsigned char* state) {
    __m256i src0 = _mm256_loadu_si256((__m256i*)(void*)src);
    __m256i src1 = _mm256_loadu_si256((__m256i*)(void*)&src[32]);

    uint64_t lt = (uint32_t)_mm256_movemask_epi8(src0) |
                    (((uint64_t)(uint32_t)_mm256_movemask_epi8(src1)) << 32);

    src0 = _mm256_cmpgt_epi8(src0, _mm256_setzero_si256());
    src1 = _mm256_cmpgt_epi8(src1, _mm256_setzero_si256());

    uint64_t gt = (uint32_t)_mm256_movemask_epi8(src0) |
                    (((uint64_t)(uint32_t)_mm256_movemask_epi8(src1)) << 32);

    // if borrow then greater-than span extends past the msb
    uint64_t m;
    unsigned char s = *state;
    *state = _subborrow_u64(s, lt, gt, (unsigned long long*)&m); // sbb
    return (m ^ lt) & ((gt - (lt + !s)) ^ gt);
}

static inline
size_t bitset_to_index (uint64_t* dst, uint64_t base, uint64_t mask) {
    int64_t cnt = _mm_popcnt_u64(mask);
    int64_t i = 0;
    do { // unroll to taste...
        dst[i + 0] = base + _tzcnt_u64(mask); mask = _blsr_u64(mask);
        dst[i + 1] = base + _tzcnt_u64(mask); mask = _blsr_u64(mask);
        dst[i + 2] = base + _tzcnt_u64(mask); mask = _blsr_u64(mask);
        dst[i + 3] = base + _tzcnt_u64(mask); mask = _blsr_u64(mask);
        i += 4;
    } while (i < cnt);
    return (size_t)cnt;
}

static
uint64_t* get_transition_indices (uint64_t* dst, int8_t* src, size_t len) {
    unsigned char state = 0; // in less-than span
    uint64_t base = 0; // offset into src array
    size_t end = len / 64;
    for (size_t i = 0; i < end; i++) {
        uint64_t mask = get_mask(src, &state);
        src += 64;
        dst += bitset_to_index(dst, base, mask);
        base += 64;
    }
    if (len % 64) {
        ; // todo: tail loop
    }
    return dst;
}
aqrit
  • 792
  • 4
  • 14
  • This is a really clever solution! Runtime performance (on Zen 1 at least) is >2 times faster (cached and not cached) than any other approach and solves the *initial state* issue aswell, with just a minimum of branches (loops). MSVC and LLVM(clang) generate almost identical output. – Chris G. Jan 29 '22 at 19:33
  • 1
    Interesting idea to `popcnt` ahead of time, to enable overshoot, and OoO exec of the loop condition instead of tying it to the `blsr` dep chain. Given 64-bit chunks and an efficient way to chain between 64-bit chunks, this is likely better than [anything I was thinking](https://stackoverflow.com/questions/70886369/how-to-efficiently-scan-2-bit-masks-alternating-each-iteration/70890642#comment125323274_70887400) with `blsmsk` / `andn` between the pair of bitmasks inside the loop to clear the other one up to the lowest set in this one. (Or OR the blsmsk result to clear both up to the same?) – Peter Cordes Jan 31 '22 at 02:24