How to efficiently scan 2 bit masks alternating each iteration

Question

Given are 2 bitmasks, that should be accessed alternating (0,1,0,1...). I try to get a runtime efficient solution, but find no better way then following example.

uint32_t mask[2] { ... };
uint8_t mask_index = 0;
uint32_t f = _tzcnt_u32(mask[mask_index]);
while (f < 32) {
    // element adding to result vector removed, since not relevant for question itself
    mask[0] >>= f + 1;
    mask[1] >>= f + 1;
    mask_index ^= 1;
    f = _tzcnt_u32(mask[mask_index]);
}

ASM output (MSVC, x64) seems blown up pretty much.

inc         r9  
add         r9,rcx  
mov         eax,esi  
mov         qword ptr [rdi+rax*8],r9  
inc         esi  
lea         rax,[rcx+1]  
shrx        r11d,r11d,eax  
mov         dword ptr [rbp],r11d  
shrx        r8d,r8d,eax  
mov         dword ptr [rbp+4],r8d  
xor         r10b,1  
movsx       rax,r10b  
tzcnt       ecx,dword ptr [rbp+rax*4]  
mov         ecx,ecx  
cmp         rcx,20h  
jb          main+240h (07FF632862FD0h)  
cmp         r9,20h  
jb          main+230h (07FF632862FC0h)

Has someone an advice?

(This is is a followup to Solve loop data dependency with SIMD - finding transitions between -1 and +1 in an int8_t array of sgn values using SIMD to create the bitmasks)

Update

I wonder if a potential solution could make use of SIMD by loading chunks of both bit streams into a register (AVX2 in my case) like this:

|m0[0]|m1[0]|m0[1]|m1[1]|m0[2]|m1[2]|m0[n+1]|m1[n+1]|

or

1 register with chunks per stream

|m0[0]|m0[1]|m0[2]|m0[n+1]|

|m1[0]|m1[1]|m1[2]|m1[n+1]|

or split the stream in chunks of same size and deal with as many lanes fit into the register at once. Let's assume we have 256*10 elements which might end up in 10 iterations like this: |m0[0]|m0[256]|m0[512]|...| |m1[0]|m1[256]|m1[512]|...| and deal with the join separately

Not sure if this might be a way to achieve more iterations per cycle and limit the need of horizontal bitscans, shift/clear op's and avoid branches.

Remove the while loop if possible, or try unroll the loop. Maybe use int64 instead of 2x int32 and do bit shifts to get the correct int. Benchmark. How fast is tzcnt? The best performance you would probably get if you can process the values sequentially without jumps. — Charles, Jan 27 '22 at 22:33
Have you tried writing it with `std::swap(mask[0], mask[1])` instead of tricking the compiler into actually keeping the masks in memory and using variable indexing on them? Or like Jerome said, unroll. (You're also tricking MSVC into zero-extending `uint8_t mask_index` to pointer width every time. For some reason it chooses sign-extension where mov-elimination doesn't work even on Intel CPUs... Anyway, don't use an index at all, but if you did, make it at least `unsigned`, not 8-bit, since it'll live in registers.) — Peter Cordes, Jan 28 '22 at 03:03
I try to avoid using std for computation in hot spots to have full control (and learn). As @PeterCordes pointed out, this is a follow up of another issue handling an isolated problem. To me the answer worked out by Jérôme Richard is perfectly explaining and solving this question. My conclusion is to avoid implied branching at an earlier stage already by using 1 mask instead of 2. — Chris G., Jan 28 '22 at 07:11
`std::swap` is trivial, but sure, I might just manually use a temporary here, too. Doesn't matter how you swap, just that you do, so the compiler will sort that out for you and at worst `xchg`, at best unroll. My main point is that the there are more compiler-friendly ways to express things. — Peter Cordes, Jan 28 '22 at 07:21
Hmm, what if you *did* have one mask instead of two? Then you use `not` between `tzcnt` and shifts? Or `blsi` / `add` to clear runs of contiguous ones? — Peter Cordes, Jan 28 '22 at 07:22
@PeterCordes I got to a similar conclusion and updated https://stackoverflow.com/questions/70853685/solve-loop-data-dependency-with-simd-finding-transitions-between-1-and-1-in accordingly. I wonder if any flipping/alternative might be necessary if `1 mask` is in place from the beginning. I think the 7 added instructions for the "bit twiddling" once per register load/init might avoid such a situation. Think this wouldd fit better in linked/related question as appears offtopic here. — Chris G., Jan 28 '22 at 07:29

Jérôme Richard · Accepted Answer · 2022-01-28T02:44:32.210

This is quite hard to optimize this loop. The main issue is that each iteration of the loop is dependent of the previous one and even instructions in the loops are dependent. This creates a long nearly sequential chain of instruction to be executed. As a result the processor cannot execute this efficiently. In addition, some instructions in this chain have a quite high latency: tzcnt has a 3-cycle latency on Intel processors and L1 load/store have a 3 cycle latency.

One solution is work directly with registers instead of an array with indirect accesses so to reduce the length of the chain and especially instruction with the highest latency. This can be done by unrolling the loop twice and splitting the problem in two different ones:

uint32_t m0 = mask[0];
uint32_t m1 = mask[1];
uint8_t mask_index = 0;

if(mask_index == 0) {
    uint32_t f = _tzcnt_u32(m0);

    while (f < 32) {
        m1 >>= f + 1;
        m0 >>= f + 1;
        f = _tzcnt_u32(m1);

        if(f >= 32)
            break;

        m0 >>= f + 1;
        m1 >>= f + 1;
        f = _tzcnt_u32(m0);
    }
}
else {
    uint32_t f = _tzcnt_u32(m1);

    while (f < 32) {
        m0 >>= f + 1;
        m1 >>= f + 1;
        f = _tzcnt_u32(m1);

        if(f >= 32)
            break;

        m0 >>= f + 1;
        m1 >>= f + 1;
        f = _tzcnt_u32(m0);
    }
}

// If mask is needed, m0 and m1 need to be stored back in mask.

This should be a bit faster, especially because a smaller critical path but also because the two shifts can be executed in parallel. Here is the resulting assembly code:

$loop:
        inc     ecx
        shr     edx, cl
        shr     eax, cl
        tzcnt   ecx, edx

        cmp     ecx, 32
        jae     SHORT $end_loop

        inc     ecx
        shr     eax, cl
        shr     edx, cl
        tzcnt   ecx, eax

        cmp     ecx, 32
        jb      SHORT $loop

Note that modern x86 processors can fuse the instructions cmp+jae and cmp+jb and the branch prediction can assume the loop will continue so it just miss-predict the last conditional jump. On Intel processors, the critical path is composed of a 1-cycle latency inc, a 1-cycle latency shr, a 3-cycle latency tzcnt resulting in a 5-cycle per round (1 round = 1 iteration of the initial loop). On AMD Zen-like processors, it is 1+1+2=4 cycles which is very good. Optimizing this further appears to be very challenging.

One possible optimization could be to use a lookup table so to compute the lower bits of m0 and m1 in bigger steps. However, a lookup table fetch has a 3-cycle latency, may cause expensive cache misses in practice, takes more memory and make the code significantly more complex since the number of trailing 0 bits can be quite big (eg. 28 bits). Thus, I am not sure this is a good idea although it certainly worth trying.

I wonder if we can use `blsmsk(m0)` then apply it to the other with `andn`. (And to itself with just `blsr`)? So the critical path becomes two 1 cycle chains, one dependent on the other. With independent `tzcnt` + `add` coming off that to get actual index values to store. Will think about that and maybe write it up after sleeping if nobody else does first. Hmm, Zen has 2-uop `blsmsk` / `blsr`, only Intel runs it as single-uop 1c latency. (Even Alder Lake E-core has 3c latency 1 uop). — Peter Cordes, Jan 28 '22 at 08:35
Too bad we don't have SIMD tzcnt, or we could even emulate those bithacks with SIMD add/subtract/boolean to get a whole bunch done in parallel to put in order later, trading latency for throughput. Maybe still do that, but store vectors then reload in a pattern for scalar tzcnt. AMD has 2/clock tzcnt although it's 2 uops. (probably bit-reverse -> lzcnt). — Peter Cordes, Jan 28 '22 at 08:40

score 2 · Answer 2 · answered Jan 28 '22 at 08:17

Here’s another way, untested. People all over internets recommend against using goto, but sometimes, like for your use case, the feature does help.

// Grab 2 more of these masks, or if you don't have any, return false
bool loadMasks( uint32_t& mask1, uint32_t& mask2 );
// Consume the found value
void consumeIndex( size_t index );

void processMasks()
{
    size_t sourceOffset = 0;
    uint32_t mask0, mask1;
    // Skip initial zeros
    while( true )
    {
        if( !loadMasks( mask0, mask1 ) )
            return;
        if( 0 != ( mask0 | mask1 ) )
            break;
        sourceOffset += 32;
    }

    constexpr uint32_t minusOne = ~(uint32_t)0;
    uint32_t idx;

    // Figure out the initial state, and jump
    if( _tzcnt_u32( mask0 ) > _tzcnt_u32( mask1 ) )
        goto testMask1;

    // Main loop below
testMask0:
    idx = _tzcnt_u32( mask0 );
    if( idx >= 32 )
    {
        sourceOffset += 32;
        if( !loadMasks( mask0, mask1 ) )
            return;
        goto testMask0;
    }
    consumeIndex( sourceOffset + idx );
    mask1 &= minusOne << ( idx + 1 );

testMask1:
    idx = _tzcnt_u32( mask1 );
    if( idx >= 32 )
    {
        sourceOffset += 32;
        if( !loadMasks( mask0, mask1 ) )
            return;
        goto testMask1;
    }
    consumeIndex( sourceOffset + idx );
    mask0 &= minusOne << ( idx + 1 );
    goto testMask0;
}

How to efficiently scan 2 bit masks alternating each iteration

2 Answers2

Linked