#include <cstddef>
#include <immintrin.h>
constexpr float get_data(__m128 a, std::size_t pos) {
return a[pos];
}
It works on GCC. I wonder is there any workaround to make it possible
#include <cstddef>
#include <immintrin.h>
constexpr float get_data(__m128 a, std::size_t pos) {
return a[pos];
}
It works on GCC. I wonder is there any workaround to make it possible
Regardless of constexpr
, a[pos]
is only valid as a GNU C extension, not portable to MSVC. Storing to an array, or C++20 std::bit_cast
to a struct might work. bit_cast
is constexpr-compatible, unlike other type-punning methods. Although I'd be worried about how efficiently that would compile across compilers for runtime-variable pos
bit_cast
does compile ok with clang, and works in a constexpr
function. But compiles inefficiently for GCC.
Correction: clang compiles this, but rejects it if called in a context that requires it to be constant-evaluated. note: constexpr bit_cast involving type '__attribute__((__vector_size__(4 * sizeof(float)))) float const' (vector of 4 'float' values) is not yet supported
.
Other failed attempts with current clang in a constexpr context:
_mm_store_ps
- not supported. Nor is *(__m128*)f = a;
because it's a reinterpret_cast.f[0] = vec[0]
etc. initializers: no, even literal constant indexing of a GNU C native vector isn't supported in clang in constexpr._mm_cvtss_f32(vec)
- non-constexpr function unusable, so no chance of using if constexpr
for separate shuffles and returns.Not-working answer, may work at some point in the future but not with clang trunk pre 15.0
#include <cstddef>
#include <immintrin.h>
#include <bit>
// portable, but inefficient with GCC
constexpr float get_data(__m128 a, std::size_t pos) {
struct foo { float f[4]; } s;
s = std::bit_cast<foo>(a);
return s.f[pos];
}
float test_idx2(__m128 a){
return get_data(a, 2);
}
float test_idxvar(__m128 a, size_t pos){
return get_data(a, pos);
}
These compile to decent asm on Godbolt, the same you'd get from clang with a[pos]
. I used -O3 -march=haswell -std=gnu++20
# clang 14 -O3 -march=haswell -std=gnu++20
# get_data has no asm output; constexpr is like inline in that respect
test_idx2(float __vector(4)):
vpermilpd xmm0, xmm0, 1 # xmm0 = xmm0[1,0]
ret
test_idxvar(float __vector(4), unsigned long):
vmovups xmmword ptr [rsp - 16], xmm0
vmovss xmm0, dword ptr [rsp + 4*rdi - 16] # xmm0 = mem[0],zero,zero,zero
ret
Store/reload is a sensible strategy for a runtime-variable index, although vmovd
/ vpermilps
would be an option since AVX introduced a variable-control shuffle that uses dword indices. An out-of-range index is UB so the compiler doesn't have any requirement to return any specific data in that case.
Using vpermilpd
for the constant index 2
is a waste of code-size vs. vmovhlps xmm0, xmm0, xmm0
or vunpckhpd
. It costs a longer VEX prefix and an immediate, so 2 bytes of machine-code size, but otherwise same performance on most CPUs.
We get a store/reload even for the fixed index of 2
, and even worse, reload by bouncing through a GP-integer register. This is a missed optimization, but IDK how quickly it would get fixed if reported. So if you're going to do this, perhaps #ifdef __clang__
or #ifdef __llvm__
for bit_cast, and #ifdef __GNUC__
for a[pos]
. (Clang defines __GNUC__
so check for that after special-casing clang.)
# gcc12 -O3 -march=haswell -std=gnu++20
test_idx2(float __vector(4)):
vmovaps XMMWORD PTR [rsp-24], xmm0
mov rax, QWORD PTR [rsp-16]
vmovd xmm0, eax # slow: should have loaded directly from mem
ret
test_idxvar(float __vector(4), unsigned long):
vmovdqa XMMWORD PTR [rsp-24], xmm0
vmovss xmm0, DWORD PTR [rsp-24+rdi*4] # this is fine, same as clang
ret
Interestingly the runtime-variable version didn't have the same anti-optimization for GCC.