Is there a better way to any detect bits that are set in a 16-byte array of flags?

Question

    ALIGNTO(16) uint8_t noise_frame_flags[16] = { 0 };

    // Code detects noise and sets noise_frame_flags omitted

    __m128i xmm0            = _mm_load_si128((__m128i*)noise_frame_flags);
    bool    isNoiseToCancel = _mm_extract_epi64(xmm0, 0) | _mm_extract_epi64(xmm0, 1);

    if (isNoiseToCancel)
        cancelNoises(audiobuffer, nAudioChannels, audio_samples, noise_frame_flags);

This is a code snippet from my AV Capture tool on Linux. noise_frame_flags here is an array of flags for 16-channel audio. For each channel, the corresponding byte can be either 0 or 1. 1 is indicating that the channel has some noise to cancel. For example, if noise_frame_flags[0] == 1, that means first channel noise flag is set (by the omitted code).

Even if a single "flag" is set then I need to call cancelNoises. And this code seems to work fine in that matter. As you see I used _mm_load_si128 to load a whole array of flags that is correctly aligned and then two _mm_extract_epi64 to extract "flags". My question is there a better way to do this (using pop count maybe)?

Note: ALIGNTO(16) is a macro expands to correct GCC equivalent but nicer looking.

[std::popcount from C++ 20](https://en.cppreference.com/w/cpp/numeric/popcount). — PaulMcKenzie, Jun 08 '22 at 07:56
Well, the performance part should be taken care of by the implementers of `popcount`. They are not going to write half-baked solutions. — PaulMcKenzie, Jun 08 '22 at 07:58
Ok but, I think I'm stuck with the C++11 standard for a while. You mentioned C++ 20. Can I still use it? — the kamilz, Jun 08 '22 at 08:00
@PaulMcKenzie: How do you propose using `std::popcount` on a `uint8_t [16]` anyway? Even if it was possible, that wrapper doesn't solve the strict-aliasing problem any more than `_popcnt64` or `_mm_popcnt_u64` (which we know are available because the code is using other intrinsics from `immintrin.h`), and it's not obvious what you'd do with 16 bytes of data to make popcnt useful. — Peter Cordes, Jun 08 '22 at 08:19
@thekamilz: Updated my answer to save a uop. `add rax, [mem]` can macro-fuse with a JCC, with some addressing modes. — Peter Cordes, Jun 08 '22 at 09:08
Have you considered replacing the array `noise_frame_flags` with a `uint16_t` bitmask? Then the test is trivial. — Nate Eldredge, Jun 08 '22 at 13:24
Yes, and I didn't want to fiddle with the bits in both functions. Thank you for the suggestion anyway. — the kamilz, Jun 08 '22 at 14:00

Peter Cordes · Accepted Answer · 2022-06-08T22:08:20.467

Yes, you eventually want a 64-bit OR to look for any non-zero bits in either half, but it's not efficient to get those uint64_t values from a 128-bit load and then extract.

In asm you just want a mov load and a memory-source or or add, which will set ZF just like you're doing now. Two loads from the same cache line are very cheap; current CPUs have at least 2/clock load throughput. The extra ALU work to extract from a single 128-bit load is just not worth it, even if you did shuffle / por to set up for a single movq.

In C++, use memcpy to do strict-aliasing safe loads of uint64_t tmp vars, then if(a | b). This is still SIMD, just SWAR (SIMD Within A Register).

add is even better than or: it can macro-fuse with most jcc instructions on Intel Sandybridge-family (but not AMD). or can't fuse with branch instructions on any CPUs. Since your values are 0 or 1, we can't have a case of two non-zero values adding to produce a zero, which is why you'd normally use or for the general case.

(Some addressing modes may defeat micro or macro-fusion on Intel. Or maybe it always works since there's no immediate involved. It really is possible for add rax, [mem] / jnz to go through the front-end and ROB as a single uop, and execute in the back-end as only 2 (load + add/sub-and-branch). Assuming it's about the same as cmp on my Skylake, except it does write the destination so Haswell and later can maybe keep it micro-fused even for indexed addressing modes.)

    uint64_t a, b;
    memcpy(&a, noise_frame_flags+0, sizeof(a));   // strict-aliasing-safe loads
    memcpy(&b, noise_frame_flags+8, sizeof(b));   // which optimize to MOV qword
    bool  isNoiseToCancel = a + b;   // equivalent to a | b  for bool inputs

This should compile to 3 asm instructions which will decode to 2 uops total, or 3 on AMD CPUs where JCC can only fuse with cmp or test.

union { alignas(16) uint8_t flags[16]; uint64_t chunks[2];}; would be safe in C99, but not ISO C++. Most but not all C++ compilers that support Intel intrinsics define the behaviour of union type-punning. (I think @jww has said SunCC doesn't.)

In C++11, you don't need a custom macro for ALIGNTO(16), just use alignas(16). Also supported in C11 if you #include <stdalign.h>

Alternatives:

movdqa 16-byte load / SSE4.1 ptest xmm0, xmm0 / jnz - 4 uops on Intel CPUs, 3 on AMD.
Intel runs ptest as 2 uops, and it can't macro-fuse with jcc.
AMD CPUs run ptest as 1 uop, but it still can't fuse.
If you had an all-ones or all-zeros constant in a register, ptest xmm0, [mem] would work to save a uop on Intel (depending on addressing mode), but that's still 3 total.

PTEST is only good for checking a 32-byte array with AVX1 or AVX2. (Surprisingly, vptest ymm only requires AVX1). Then it's about break-even with AVX2 vmovdqa / vpslld ymm0, 7 / vpmovmskb eax,ymm0 / test+jnz. See TrentP's answer for portable GNU C native vector source code that should compile to vptest on x86 with AVX available, and maybe to something clunky on other ISAs like ARM depending on how good their horizontal OR support is.

popcnt wouldn't be useful unless you want to break down the work depending on how many bits are set.
In that case, yes, sure, you can turn the bool array into a bitmap that you can scan easily, probably more efficient than _mm_sad_epu8 against a zeroed register to sum into two 8-byte halves.

   __m128i vflags = _mm_load_si128((__m128i*)noise_frame_flags);
   vflags = _mm_slli_epi32(vflags, 7);
   unsigned flagmask = _mm_movemask_epi8(vflags);
   if (flagmask) {
       unsigned flagcount = __builtin_popcount(flagmask);  // popcnt with -march=nehalem or higher
       unsigned first_setflag = __builtin_ctz(flagmask);   // tzcnt if available, else BSF
       vflags &= vflags - 1;   // clear lowest set bit.  blsr if compiled with -march=haswell or bdver2 or newer.
      ...
   }

(Don't actually use -march=bdver2 or -march=nehalem, unless you want to set an ISA baseline but also use -mtune=haswell or something more modern. There are individual options like -mpopcnt and -mbmi, but generally good to enable all ISA extensions that some CPU supports, so you don't miss out on useful stuff the compiler can use.)

N.B.: `ptest xmm0, [mem]` also works with an all-zero register, if you check `CF` instead of `ZF` (I assume there is no other advantage, though). — chtz, Jun 08 '22 at 11:21
@chtz: Oh right, yeah. Saves a back-end uop on Intel, but not a front-end uop. (xor-zeroing is as cheap as a NOP on Intel, but on AMD it and the all-ones idiom both need a back-end uop to write zeros or ones to a vector.) — Peter Cordes, Jun 08 '22 at 17:41

score 2 · Answer 2 · answered Jun 08 '22 at 08:16

2

Here's what I came up with for doing this:

#define VLEN 8
typedef int vNb __attribute__((vector_size(VLEN*sizeof(int))));

// Constants for 128 or 256 bit registers
#if VLEN == 8
#define V(a,b,c,d,e,f,g,h) a,b,c,d,e,f,g,h
#else
#define V(a,b,c,d,e,f,g,h) a,b,c,d
#endif
#define SWAP128 V(4,5,6,7, 0,1,2,3)
#define SWAP64 V(2,3, 0,1,  6,7, 4,5)
#define SWAP32 V(1, 0,  3, 2,  5, 4,  7, 6)

static bool any(vNb x) {
    if (VLEN >= 8)
        x |= __builtin_shufflevector(x,x, SWAP128);
    x |= __builtin_shufflevector(x,x, SWAP64);
    x |= __builtin_shufflevector(x,x, SWAP32);
    return x[0];
}

With VLEN = 8, this will use 256-bit registers if the arch supports it. Change to 4 to use 128 bit.

This should compile to a single vptest instruction.

answered Jun 08 '22 at 08:16

TrentP

4,240
24
35

1

Good idea if SSE4.1 is available, but PTEST is 2 uops on mainstream Intel CPUs (1 on Zen, or E-cores of Alder Lake though). So `movdqa` / `ptest xmm0,xmm0` / `jcc` is 4 (or 3) uops total. vs. 3 uops on all modern CPUs for `mov rax, [mem]` / `or rax, [mem+4]` / `jcc`. (Or 4 on Sandybridge or IvyBridge with an indexed addressing mode so the OR unlaminates). Also larger machine-code size. In both cases the OR doesn't macro-fuse with the JCC. Hmm, actually `add` can macro-fuse, and can't overflow, so that's actually more efficient... I'll update my answer. – Peter Cordes Jun 08 '22 at 08:49
This idea is good for a 32-byte vector, though; then `vptest ymm,ymm` is efficient and only requires AVX1, vs. `vmovdqa` load / `vpslld ymm0, 7` / `vpmovmskb` / fused `test+jcc` (also 4 uops including the separate load; unfortunately only AVX-512 can use a memory source vector for shift-immediate). BTW, `vptest ymm0, [mem]` could work with an all-ones constant in YMM0, but that only helps if doing it in a loop. If you have to pay for the `vpcmpeqd ymm0,ymm0,ymm0` for every test, it's break-even for the front-end and more back-end uops. – Peter Cordes Jun 08 '22 at 08:51

Is there a better way to any detect bits that are set in a 16-byte array of flags?

2 Answers2

Alternatives: