Yes, you eventually want a 64-bit OR to look for any non-zero bits in either half, but it's not efficient to get those uint64_t
values from a 128-bit load and then extract.
In asm you just want a mov
load and a memory-source or
or add
, which will set ZF just like you're doing now. Two loads from the same cache line are very cheap; current CPUs have at least 2/clock load throughput. The extra ALU work to extract from a single 128-bit load is just not worth it, even if you did shuffle / por
to set up for a single movq
.
In C++, use memcpy
to do strict-aliasing safe loads of uint64_t
tmp vars, then if(a | b)
. This is still SIMD, just SWAR (SIMD Within A Register).
add
is even better than or
: it can macro-fuse with most jcc
instructions on Intel Sandybridge-family (but not AMD). or
can't fuse with branch instructions on any CPUs. Since your values are 0
or 1
, we can't have a case of two non-zero values adding to produce a zero, which is why you'd normally use or
for the general case.
(Some addressing modes may defeat micro or macro-fusion on Intel. Or maybe it always works since there's no immediate involved. It really is possible for add rax, [mem]
/ jnz
to go through the front-end and ROB as a single uop, and execute in the back-end as only 2 (load + add/sub-and-branch). Assuming it's about the same as cmp
on my Skylake, except it does write the destination so Haswell and later can maybe keep it micro-fused even for indexed addressing modes.)
uint64_t a, b;
memcpy(&a, noise_frame_flags+0, sizeof(a)); // strict-aliasing-safe loads
memcpy(&b, noise_frame_flags+8, sizeof(b)); // which optimize to MOV qword
bool isNoiseToCancel = a + b; // equivalent to a | b for bool inputs
This should compile to 3 asm instructions which will decode to 2 uops total, or 3 on AMD CPUs where JCC can only fuse with cmp
or test
.
union { alignas(16) uint8_t flags[16]; uint64_t chunks[2];};
would be safe in C99, but not ISO C++. Most but not all C++ compilers that support Intel intrinsics define the behaviour of union type-punning. (I think @jww has said SunCC doesn't.)
In C++11, you don't need a custom macro for ALIGNTO(16)
, just use alignas(16)
. Also supported in C11 if you #include <stdalign.h>
Alternatives:
movdqa
16-byte load / SSE4.1 ptest xmm0, xmm0
/ jnz
- 4 uops on Intel CPUs, 3 on AMD.
Intel runs ptest
as 2 uops, and it can't macro-fuse with jcc
.
AMD CPUs run ptest
as 1 uop, but it still can't fuse.
If you had an all-ones or all-zeros constant in a register, ptest xmm0, [mem]
would work to save a uop on Intel (depending on addressing mode), but that's still 3 total.
PTEST is only good for checking a 32-byte array with AVX1 or AVX2. (Surprisingly, vptest ymm
only requires AVX1). Then it's about break-even with AVX2 vmovdqa
/ vpslld ymm0, 7
/ vpmovmskb eax,ymm0
/ test+jnz
. See TrentP's answer for portable GNU C native vector source code that should compile to vptest
on x86 with AVX available, and maybe to something clunky on other ISAs like ARM depending on how good their horizontal OR support is.
popcnt
wouldn't be useful unless you want to break down the work depending on how many bits are set.
In that case, yes, sure, you can turn the bool array into a bitmap that you can scan easily, probably more efficient than _mm_sad_epu8
against a zeroed register to sum into two 8-byte halves.
__m128i vflags = _mm_load_si128((__m128i*)noise_frame_flags);
vflags = _mm_slli_epi32(vflags, 7);
unsigned flagmask = _mm_movemask_epi8(vflags);
if (flagmask) {
unsigned flagcount = __builtin_popcount(flagmask); // popcnt with -march=nehalem or higher
unsigned first_setflag = __builtin_ctz(flagmask); // tzcnt if available, else BSF
vflags &= vflags - 1; // clear lowest set bit. blsr if compiled with -march=haswell or bdver2 or newer.
...
}
(Don't actually use -march=bdver2
or -march=nehalem
, unless you want to set an ISA baseline but also use -mtune=haswell
or something more modern. There are individual options like -mpopcnt
and -mbmi
, but generally good to enable all ISA extensions that some CPU supports, so you don't miss out on useful stuff the compiler can use.)