Convert bit vector to one bit

Question

Is there an efficient way to get 0x00000001 or 0xFFFFFFFF for a non-zero unsigned integer values, and 0 for zero without branching?

I want to test several masks and create another mask based on that. Basically, I want to optimize the following code:

unsigned getMask(unsigned x, unsigned masks[4])
{
    return (x & masks[0] ? 1 : 0) | (x & masks[1] ? 2 : 0) |
           (x & masks[2] ? 4 : 0) | (x & masks[3] ? 8 : 0);
}

I know that some optimizing compilers can handle this, but even if that's the case, how exactly do they do it? I looked through the Bit twiddling hacks page, but found only a description of conditional setting/clearing of a mask using a boolean condition, so the conversion from int to bool should be done outside the method.

If there is no generic way to solve this, how can I do that efficiently using x86 assembler code?

Peter Cordes · Answer 1 · 2018-07-25T13:19:30.327

x86 SSE2 can do this in a few instructions, the most important being movmskps which extracts the top bit of each 4-byte element of a SIMD vector into an integer bitmap.

Intel's intrinsics guide is pretty good, see also the SSE tag wiki

#include <immintrin.h>

static inline
unsigned getMask(unsigned x, unsigned masks[4])
{
    __m128i vx = _mm_set1_epi32(x);
    __m128i vm = _mm_load_si128(masks);    // or loadu if this can inline where masks[] isn't aligned

    __m128i and = _mm_and_si128(vx, vm);

    __m128i eqzero = _mm_cmpeq_epi32(and, _mm_setzero_si128());   // vector of 0 or -1 elems
    unsigned zeromask = _mm_movemask_ps(_mm_castsi128_ps(eqzero));
    return zeromask ^ 0xf;  // flip the low 4 bits
}

Until AVX512, there's no SIMD cmpneq, so the best option is scalar XOR after extracting a mask. (We want to just flip the low 4 bits, not all of them with a NOT.)

score 1 · Answer 2 · answered Nov 24 '13 at 08:53

1

The usual way to do this in x86 is:

test eax, eax
setne al

answered Nov 24 '13 at 08:53

simonzack

19,729
13
73
118

But this is hardly branchless. – Nov 24 '13 at 08:55
4

I'm interpreting branchless here as having no jmp instructions. Shifting and xor hacks are probably going to be slower. – simonzack Nov 24 '13 at 08:56
1

Thanks! BTW why do you believe it is hardly branchless? Is 'setne' a time-consuming operation? – bkxp Nov 24 '13 at 08:58
zf is just part of the eflags register, like all the other registers. It's probably faster than any shift add or xor instruction. A lot of c++ compilers actually emit this sort of code. – simonzack Nov 24 '13 at 08:59

score 1 · Answer 3 · answered Jun 11 '18 at 03:25

1

You can use !! to coerce a value to 0 or 1 and rewrite the expression like this

return !!(x & masks[0]) | (!!(x & masks[1]) << 1) |
       (!!(x & masks[2]) << 2) | (!!(x & masks[3]) << 3);

answered Jun 11 '18 at 03:25

phuclv

37,963
15
156
475

Convert bit vector to one bit

3 Answers3