For your overall positional-popcount problem, see https://github.com/mklarqvist/positional-popcount for heavily optimized implementations, which are also correct unlike this, which you obviously haven't had time to debug yet since you were missing a building block. Adding multiple x & (1<<15)
results in an int16_t
element is going to saturate right away, so you'd need something, perhaps a variable-count shift or a compare like x & mask == mask
. Or probably better a total redesign: Related SO Q&As:
The title question: broadcast a uint16_t
The instruction is vpbroadcastw
. It works with a memory or xmm source. On Intel CPUs, it decodes to a load and a shuffle (port 5) uop, unlike 32, 64, or 128-bit broadcasts which are handled purely in the load port.
The intrinsics for it are:
__m256i _mm256_set1_epi16( int16_t )
- if you only have a scalar.
__m256i _mm256_broadcastw_epi16 (__m128i a)
- to broadcast the bottom element of a vector.
To avoid violating the strict-aliasing rule in C, you're correct that accessing uint64_t p[]
elements and masking them is a safe approach, while point a uint16_t *
at it wouldn't be. (If you deref it normally; but unfortunately there's no load intrinsic that hides the deref inside an aliasing-safe intrinsic, so you'd have to memcpy into a uint16_t
tmp var or something...)
Modern GCC is smart enough to compile __m256i v4 = _mm256_set1_epi16((p[i] >> 48) & 0xFFFF);
into vpbroadcastw ymm0, WORD PTR [rdi+6+rdx*8]
, not doing anything stupid like an actual 64-bit scalar shift and then vmovd
+ xmm-source broadcast. (even with only -Og
https://godbolt.org/z/W6o5hKTbz)
But that's when only using one of the counts, with the others optimized away. (I just used a volatile __m256i sink
to assign things to as a way to stop the optimizer removing the loop entirely.)
https://godbolt.org/z/fzs9PEbMq shows with heavier optimization, using count2 and count4 gets GCC to do a scalar load of the uint64_t and break it up with two separate scalar shifts, before vmovd xmm0, edx
/ ... / vmovd xmm0, eax
. So that's quite bad. :/
// compiles to a vpbroadcastw load with an offset
// but violates strict aliasing
__m256i v2 = _mm256_set1_epi16( *(1 + (uint16_t*)&p[i]) );
To make that safe, you could use memcpy
into a temporary, or GNU C __attribute__((may_alias))
. (The same attribute is used in the definition of __m256i
itself).
typedef uint16_t aliasing_u16 __attribute__((aligned(1), may_alias));
__m256i v1 = _mm256_set1_epi16(*(0 + (aliasing_u16*)&p[i]));
__m256i v2 = _mm256_set1_epi16(*(1 + (aliasing_u16*)&p[i]));
__m256i v3 = _mm256_set1_epi16(*(2 + (aliasing_u16*)&p[i]));
__m256i v4 = _mm256_set1_epi16(*(3 + (aliasing_u16*)&p[i]));
Compiles with 4x vpbroadcastw loads (https://godbolt.org/z/6v9esqK9P). (Instructions using those loads elided)
vpbroadcastw ymm1, WORD PTR [rdi]
...
add rdi, 8
vpbroadcastw ymm1, WORD PTR [rdi-6]
...
vpbroadcastw ymm1, WORD PTR [rdi-4]
...
vpbroadcastw ymm1, WORD PTR [rdi-2]
...
This is probably better to avoid bottlenecks on port 5 on Intel CPUs. Both vmovd xmm, eax
and vpbroadcastw ymm,xmm
are 1 uop that can only run on port 5 on Skylake-family CPUs. (https://agner.org/optimize/ https://uops.info/).
vpbroadcastw
with a memory source still needs a shuffle uop (p5), but getting the data from elsewhere into the SIMD domain uses a load port instead of another port 5 uop. And it can micro-fuse the load into a single front-end uop.