Why can't Clang get __m128's data by index in constexpr function

Question

#include <cstddef>
#include <immintrin.h>

constexpr float get_data(__m128 a, std::size_t pos) {
  return a[pos];
}

It works on GCC. I wonder is there any workaround to make it possible

[Get member of __m128 by index?](https://stackoverflow.com/q/12624466/995714), [accessing __m128 fields across compilers](https://stackoverflow.com/q/19582893/995714), [Accessing the fields of a __m128i variable in a portable way](https://stackoverflow.com/q/71616855/995714) — phuclv, Jul 23 '22 at 02:18
From the docs: *You should not access the __m128 fields directly.* Anyway, what would be the purpose of a `constexpr` CPU register? — Adrian Mole, Jul 23 '22 at 04:27
I'm working on a math library. I think it's will be cool if it can use simd instructions and constexpr. — ryblust, Jul 23 '22 at 06:49
@AdrianMole: `__m128` is no more a CPU register than `int` or `float`. Of course it's most common to only use `__m128` objects that are local temporaries that the optimizer can keep in registers *if it can't do constant-propagation through the computation*, while `int` gets used in arrays and stuff as well as for locals. But there's the thing, sometimes your vectorized function might get used with constant inputs, and you don't want to have to write a separate version of it or the caller that uses `if(__builtin_constant_p(a))` or `if(std::is_constant_evaluated())`. — Peter Cordes, Jul 23 '22 at 16:24
But yes, regardless of constexpr, `a[pos]` is only valid as a GNU C extension, not portable to MSVC. Storing to an array, or C++20 `std::bit_cast` to a `struct{ float f[4]; }` might work. C++20 is constexpr-compatible, unlike other type-punning methods. Although I'd be worried about how efficiently that would compile across compilers for runtime-variable `pos`. — Peter Cordes, Jul 23 '22 at 16:27
@PeterCordes: Yes, that's the thing. I had tried `std::bit_cast` and `__builtin_bit_cast`, unfortunately, Clang does not support. — ryblust, Jul 23 '22 at 17:24
Clang supports `std::bit_cast`, maybe you forgot to tell it to compile as C++20 or later? https://godbolt.org/z/Khvxhnff9 shows a version of your function using it. — Peter Cordes, Jul 23 '22 at 17:53

Peter Cordes · Accepted Answer · 2022-07-24T14:51:48.057

Regardless of constexpr, a[pos] is only valid as a GNU C extension, not portable to MSVC. Storing to an array, or C++20 std::bit_cast to a struct might work. bit_cast is constexpr-compatible, unlike other type-punning methods. Although I'd be worried about how efficiently that would compile across compilers for runtime-variable pos

bit_cast does compile ok with clang, and works in a constexpr function. But compiles inefficiently for GCC.

Correction: clang compiles this, but rejects it if called in a context that requires it to be constant-evaluated. note: constexpr bit_cast involving type '__attribute__((__vector_size__(4 * sizeof(float)))) float const' (vector of 4 'float' values) is not yet supported.

Other failed attempts with current clang in a constexpr context:

_mm_store_ps - not supported. Nor is *(__m128*)f = a; because it's a reinterpret_cast.
f[0] = vec[0] etc. initializers: no, even literal constant indexing of a GNU C native vector isn't supported in clang in constexpr.
union type punning: reading an inactive member not allowed in a constexpr context
_mm_cvtss_f32(vec) - non-constexpr function unusable, so no chance of using if constexpr for separate shuffles and returns.

Not-working answer, may work at some point in the future but not with clang trunk pre 15.0

#include <cstddef>
#include <immintrin.h>
#include <bit>

// portable, but inefficient with GCC
constexpr float get_data(__m128 a, std::size_t pos) {
    struct foo { float f[4]; } s;
    s = std::bit_cast<foo>(a);
    return s.f[pos];
}

float test_idx2(__m128 a){
    return get_data(a, 2);
}

float test_idxvar(__m128 a, size_t pos){
    return get_data(a, pos);
}

These compile to decent asm on Godbolt, the same you'd get from clang with a[pos]. I used -O3 -march=haswell -std=gnu++20

# clang 14 -O3 -march=haswell -std=gnu++20
# get_data has no asm output; constexpr is like inline in that respect

test_idx2(float __vector(4)):
        vpermilpd       xmm0, xmm0, 1           # xmm0 = xmm0[1,0]
        ret
test_idxvar(float __vector(4), unsigned long):
        vmovups xmmword ptr [rsp - 16], xmm0
        vmovss  xmm0, dword ptr [rsp + 4*rdi - 16] # xmm0 = mem[0],zero,zero,zero
        ret

Store/reload is a sensible strategy for a runtime-variable index, although vmovd / vpermilps would be an option since AVX introduced a variable-control shuffle that uses dword indices. An out-of-range index is UB so the compiler doesn't have any requirement to return any specific data in that case.

Using vpermilpd for the constant index 2 is a waste of code-size vs. vmovhlps xmm0, xmm0, xmm0 or vunpckhpd. It costs a longer VEX prefix and an immediate, so 2 bytes of machine-code size, but otherwise same performance on most CPUs.

Unfortunately GCC doesn't do such a good job

We get a store/reload even for the fixed index of 2, and even worse, reload by bouncing through a GP-integer register. This is a missed optimization, but IDK how quickly it would get fixed if reported. So if you're going to do this, perhaps #ifdef __clang__ or #ifdef __llvm__ for bit_cast, and #ifdef __GNUC__ for a[pos]. (Clang defines __GNUC__ so check for that after special-casing clang.)

# gcc12 -O3 -march=haswell -std=gnu++20
test_idx2(float __vector(4)):
        vmovaps XMMWORD PTR [rsp-24], xmm0
        mov     rax, QWORD PTR [rsp-16]
        vmovd   xmm0, eax              # slow: should have loaded directly from mem
        ret

test_idxvar(float __vector(4), unsigned long):
        vmovdqa XMMWORD PTR [rsp-24], xmm0
        vmovss  xmm0, DWORD PTR [rsp-24+rdi*4]   # this is fine, same as clang
        ret

Interestingly the runtime-variable version didn't have the same anti-optimization for GCC.

[https://godbolt.org/z/T6bvzWjfW](https://godbolt.org/z/T6bvzWjfW) It's still not working. Clang does not support `std::bit_cast` `__m128` to struct of 4 floats. — ryblust, Jul 24 '22 at 10:43
@sorry: My answer shows clang does support that `bit_cast`. The not yet supported part is unfortunately doing it in a constexpr context; if you'd said that in the first place, I could have avoided wasting time writing it. I assumed if bit_cast worked at all, it would work as constexpr, but apparently that's not the case. — Peter Cordes, Jul 24 '22 at 14:15
I'm sorry I didn't make it clear before. But I learned from your answers. Thanks for your replying. — ryblust, Jul 24 '22 at 15:10

Why can't Clang get __m128's data by index in constexpr function

1 Answers1

Unfortunately GCC doesn't do such a good job

Linked