Clang: autovectorize conversion of bool[64] array to uint64_t bit mask

Question

I want to convert a bool[64] into a uint64_t where each bit represents the value of an element in the input array.

On modern x86 processors, this can be done quite efficiently, e.g. using vptestmd with AVX512 or vpmovmskb with AVX256. When I use clang's bool vector extension in combination with __builtin_convertvector, I'm happy with the results:

uint64_t handvectorized(const bool* input_) noexcept {
    const bool* __restrict input = std::assume_aligned<64>(input_);

    using VecBool64 __attribute__((vector_size(64))) = char;
    using VecBitmaskT __attribute__((ext_vector_type(64))) = bool;

    auto input_vec = *reinterpret_cast<const VecBool64*>(input);
    auto result_vec = __builtin_convertvector(input_vec, VecBitmaskT);    

    return reinterpret_cast<uint64_t&>(result_vec);
}

produces (godbolt):

vmovdqa64  zmm0, zmmword ptr [rdi]
vptestmb   k0, zmm0, zmm0
kmovq      rax, k0
vzeroupper
ret

However, I can not get clang (or GCC, or ICX) to produce anything that uses vector mask extraction like this with (portable) scalar code. For this implementation:

uint64_t loop(const bool* input_) noexcept {
    const bool* __restrict input = std::assume_aligned<64>(input_);

    uint64_t result = 0;
    for(size_t i = 0; i < 64; ++i) {
        if(input[i]) {
            result |= 1ull << i;
        }
    }
    return result;
}

clang produces a 64*8B = 512B lookup table and 39 instructions.

This implementation, and some other scalar implementations (branchless, inverse bit order, using std::bitset) that I've tried, can all be found on godbolt. None of them results in code close to the handwritten vector instructions.

Is there anything I'm missing or any reason the optimization doesn't work well here? Is there a scalar version I can write that produces reasonably vectorized code?

I'm wondering especially since the "handvectorized" version doesn't use any platform-specific intrinsics, and doesn't really have much programming to it. All it does is "load as vector" and "convert to bitmask". Perhaps clang simply doesn't detect the loop pattern? It just feels strange to me, a simple bitwise OR reduction loop feels like a common pattern, and the documentation of the loop vectorizer explicitly lists reductions using OR as a supported feature.

Edit: Updated godbolt link with suggestions from the comments

Edit2: I just realized there is an open LLVM issue for exactly this problem: https://github.com/llvm/llvm-project/issues/41997

Related: [Is it possible to convince clang to auto-vectorize this code without using intrinsics?](https://stackoverflow.com/q/56231449) - failure to auto-vectorize with `vpmovmskb` for compares; the answer succeeded at getting clang to do something else with compare results, but that's not an option here where you really want the mask. Also [How to create a byte out of 8 bool values (and vice versa)?](https://stackoverflow.com/q/8461126) but none of those will auto-vectorize into pmovmskb either. — Peter Cordes, Jan 06 '23 at 12:33
Is `std::bitset<64>(input, 64)` even correct for `bool *input`? The only ISO standard constructor that takes a pointer wants `CharT*` https://en.cppreference.com/w/cpp/utility/bitset/bitset. That might be why it makes code that's even more insane than most, `cmp`/`jne` on each byte to check that it has value `1`, otherwise call `std::__throw_invalid_argument` with arg `.asciz "bitset::_M_copy_from_ptr"`. (I guess integer `1` is the only possible value for an implicit-length C string that's not a terminator and also a valid `bool`? IDK how there wasn't a type mismatch.) — Peter Cordes, Jan 06 '23 at 12:54
Anyway, your Godbolt already included all my ideas; the minor variations I tried like `result |= (uint64_t)input[i] << i;` instead of a multiply didn't help. I'm not optimistic it's possible; I'm curious if anyone knows what to look for in the source code to maybe figure out if clang / LLVM does know how to invent `vpmovmskb` itself at all. — Peter Cordes, Jan 06 '23 at 12:55
@PeterCordes thanks for the ideas. I think for the bitset, `CharT` can be `bool`, at least I don't see anything preventing that on cppreference. But you're right, I messed up the zero and one values, as `'0'` and `'1'` aren't `0` and `1`. If you use `std::bitset<64>(input, 64, false, true)`, the code is a bit better (error checks are skipped), but still not properly vectorized. — He3lixxx, Jan 06 '23 at 13:01
Ah, right, I see how that constructor works now, taking args for for what maps to false vs. true. That would definitely be a place that a good library implementation could specialize the template the intrinsics, similar to how `std::bitset .count()` was a way that C++ pre-20 could expose an efficient popcnt via a portable API. But I guess libstdc++ doesn't do that, probably not libc++ either (`-stdlib=libc++`) — Peter Cordes, Jan 06 '23 at 13:11
Possibly relevant: https://stackoverflow.com/q/15273964/580083. The problem might be with `basic_string`, namely with missing `char_traits`. For example, libc++ now defines a generic primary template for `char_traits`, but it is described as to be removed in the future: https://github.com/llvm/llvm-project/blob/main/libcxx/include/__string/char_traits.h#L75. — Daniel Langr, Jan 06 '23 at 14:05

Clang: autovectorize conversion of bool[64] array to uint64_t bit mask

0 Answers0