Given are 2 bitmasks, that should be accessed alternating (0,1,0,1...). I try to get a runtime efficient solution, but find no better way then following example.
uint32_t mask[2] { ... };
uint8_t mask_index = 0;
uint32_t f = _tzcnt_u32(mask[mask_index]);
while (f < 32) {
// element adding to result vector removed, since not relevant for question itself
mask[0] >>= f + 1;
mask[1] >>= f + 1;
mask_index ^= 1;
f = _tzcnt_u32(mask[mask_index]);
}
ASM output (MSVC, x64) seems blown up pretty much.
inc r9
add r9,rcx
mov eax,esi
mov qword ptr [rdi+rax*8],r9
inc esi
lea rax,[rcx+1]
shrx r11d,r11d,eax
mov dword ptr [rbp],r11d
shrx r8d,r8d,eax
mov dword ptr [rbp+4],r8d
xor r10b,1
movsx rax,r10b
tzcnt ecx,dword ptr [rbp+rax*4]
mov ecx,ecx
cmp rcx,20h
jb main+240h (07FF632862FD0h)
cmp r9,20h
jb main+230h (07FF632862FC0h)
Has someone an advice?
(This is is a followup to Solve loop data dependency with SIMD - finding transitions between -1 and +1 in an int8_t array of sgn values using SIMD to create the bitmasks)
Update
I wonder if a potential solution could make use of SIMD by loading chunks of both bit streams into a register (AVX2 in my case) like this:
|m0[0]|m1[0]|m0[1]|m1[1]|m0[2]|m1[2]|m0[n+1]|m1[n+1]|
or
1 register with chunks per stream
|m0[0]|m0[1]|m0[2]|m0[n+1]|
|m1[0]|m1[1]|m1[2]|m1[n+1]|
or split the stream in chunks of same size and deal with as many lanes fit into the register at once. Let's assume we have 256*10 elements which might end up in 10 iterations like this: |m0[0]|m0[256]|m0[512]|...| |m1[0]|m1[256]|m1[512]|...| and deal with the join separately
Not sure if this might be a way to achieve more iterations per cycle and limit the need of horizontal bitscans, shift/clear op's and avoid branches.