I want to convert a bool[64]
into a uint64_t
where each bit represents the value of an element in the input array.
On modern x86 processors, this can be done quite efficiently, e.g. using vptestmd
with AVX512 or vpmovmskb
with AVX256. When I use clang's bool vector extension in combination with __builtin_convertvector
, I'm happy with the results:
uint64_t handvectorized(const bool* input_) noexcept {
const bool* __restrict input = std::assume_aligned<64>(input_);
using VecBool64 __attribute__((vector_size(64))) = char;
using VecBitmaskT __attribute__((ext_vector_type(64))) = bool;
auto input_vec = *reinterpret_cast<const VecBool64*>(input);
auto result_vec = __builtin_convertvector(input_vec, VecBitmaskT);
return reinterpret_cast<uint64_t&>(result_vec);
}
produces (godbolt):
vmovdqa64 zmm0, zmmword ptr [rdi]
vptestmb k0, zmm0, zmm0
kmovq rax, k0
vzeroupper
ret
However, I can not get clang (or GCC, or ICX) to produce anything that uses vector mask extraction like this with (portable) scalar code. For this implementation:
uint64_t loop(const bool* input_) noexcept {
const bool* __restrict input = std::assume_aligned<64>(input_);
uint64_t result = 0;
for(size_t i = 0; i < 64; ++i) {
if(input[i]) {
result |= 1ull << i;
}
}
return result;
}
clang produces a 64*8B = 512B lookup table and 39 instructions.
This implementation, and some other scalar implementations (branchless, inverse bit order, using std::bitset
) that I've tried, can all be found on godbolt. None of them results in code close to the handwritten vector instructions.
Is there anything I'm missing or any reason the optimization doesn't work well here? Is there a scalar version I can write that produces reasonably vectorized code?
I'm wondering especially since the "handvectorized" version doesn't use any platform-specific intrinsics, and doesn't really have much programming to it. All it does is "load as vector" and "convert to bitmask". Perhaps clang simply doesn't detect the loop pattern? It just feels strange to me, a simple bitwise OR reduction loop feels like a common pattern, and the documentation of the loop vectorizer explicitly lists reductions using OR as a supported feature.
Edit: Updated godbolt link with suggestions from the comments
Edit2: I just realized there is an open LLVM issue for exactly this problem: https://github.com/llvm/llvm-project/issues/41997