4

Is there an efficient way to get 0x00000001 or 0xFFFFFFFF for a non-zero unsigned integer values, and 0 for zero without branching?

I want to test several masks and create another mask based on that. Basically, I want to optimize the following code:

unsigned getMask(unsigned x, unsigned masks[4])
{
    return (x & masks[0] ? 1 : 0) | (x & masks[1] ? 2 : 0) |
           (x & masks[2] ? 4 : 0) | (x & masks[3] ? 8 : 0);
}

I know that some optimizing compilers can handle this, but even if that's the case, how exactly do they do it? I looked through the Bit twiddling hacks page, but found only a description of conditional setting/clearing of a mask using a boolean condition, so the conversion from int to bool should be done outside the method.

If there is no generic way to solve this, how can I do that efficiently using x86 assembler code?

phuclv
  • 37,963
  • 15
  • 156
  • 475
bkxp
  • 1,115
  • 1
  • 12
  • 20

3 Answers3

2

x86 SSE2 can do this in a few instructions, the most important being movmskps which extracts the top bit of each 4-byte element of a SIMD vector into an integer bitmap.

Intel's intrinsics guide is pretty good, see also the SSE tag wiki

#include <immintrin.h>

static inline
unsigned getMask(unsigned x, unsigned masks[4])
{
    __m128i vx = _mm_set1_epi32(x);
    __m128i vm = _mm_load_si128(masks);    // or loadu if this can inline where masks[] isn't aligned

    __m128i and = _mm_and_si128(vx, vm);

    __m128i eqzero = _mm_cmpeq_epi32(and, _mm_setzero_si128());   // vector of 0 or -1 elems
    unsigned zeromask = _mm_movemask_ps(_mm_castsi128_ps(eqzero));
    return zeromask ^ 0xf;  // flip the low 4 bits
}

Until AVX512, there's no SIMD cmpneq, so the best option is scalar XOR after extracting a mask. (We want to just flip the low 4 bits, not all of them with a NOT.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
1

The usual way to do this in x86 is:

test eax, eax
setne al
simonzack
  • 19,729
  • 13
  • 73
  • 118
  • But this is hardly branchless. –  Nov 24 '13 at 08:55
  • 4
    I'm interpreting branchless here as having no jmp instructions. Shifting and xor hacks are probably going to be slower. – simonzack Nov 24 '13 at 08:56
  • 1
    Thanks! BTW why do you believe it is hardly branchless? Is 'setne' a time-consuming operation? – bkxp Nov 24 '13 at 08:58
  • zf is just part of the eflags register, like all the other registers. It's probably faster than any shift add or xor instruction. A lot of c++ compilers actually emit this sort of code. – simonzack Nov 24 '13 at 08:59
1

You can use !! to coerce a value to 0 or 1 and rewrite the expression like this

return !!(x & masks[0]) | (!!(x & masks[1]) << 1) |
       (!!(x & masks[2]) << 2) | (!!(x & masks[3]) << 3);
phuclv
  • 37,963
  • 15
  • 156
  • 475