This doesn t appears to work/compile
void vec(size_t n) {
typedef char v4si __attribute__((vector_size(n)));
v4si t={1};
}
Is there a proper way to declare this or is it unsupported?
This doesn t appears to work/compile
void vec(size_t n) {
typedef char v4si __attribute__((vector_size(n)));
v4si t={1};
}
Is there a proper way to declare this or is it unsupported?
No, that would make no sense. It's like trying to select uint32_t vs. uint64_t at runtime based on the value of some variable.
Manual vectorization does not work by treating the whole array as one giant SIMD vector, it works by telling the compiler exactly how to use fixed-size short vectors. If auto-vectorization doesn't work with normal arrays, this is not going to help.
To get GCC to "try harder" to auto-vectorize a loop if you don't want to do it manually, there #pragma omp SIMD
with gcc -fopenmp
which can auto-vectorize at -O2
. Or compiling with -O3
will consider every loop as a candidate for auto-vectorization. (Also stuff on single structs; clang is generally better at finding SIMD use-cases in non-looping code than gcc, though. clang may sometimes be too aggressive and spend more time shuffling data together than it would cost to just do separate scalar work.)
But note that GCC and clang's auto-vectorization can only work if the loop trip-count can be calculated before the first iteration. It can be a runtime variable count, but an if()break;
exit condition that could trigger at any time depending on data will defeat them. So e.g. they can't auto-vectorize a naive looping strlen
or strchr
implementation that uses while(*p++ != 0){...}
. ICC can do that.
Also if you need any kind of shuffling, you'll often need to do that yourself with GNU C native vectors, or target-specific intrinsics like SSE/AVX for x86, NEON/AdvSIMD for ARM, AltiVec for Power, etc.
Cray machines apparently had SIMD that worked by giving the hardware a pointer + length and letting it "loop" in whatever chunks it wanted (maybe like how modern x86 rep movsd
can actually use larger chunks in its microcode). But modern CPUs have fixed-width short-vector SIMD instructions that can for example do exactly 16 or exactly 32 bytes.
(ARM SVE is sort of part-way between, allowing forward compatibility for code to take advantage of wider vectors on future HW instead of fully baking in a vector width. It's still a fixed size you can't control, though. You still have to loop using it, and increment your pointer by the hardware's vector-width. It has masking stuff to ignore elements past the end of what you want to process so you can use it for arbitrarily short arrays, I think, and for the leftover end of an array. But for arbitrarily long arrays you still need to loop. Also, very few CPUs support SVE yet. BTW, SVE is a similar concept to SIMD in Agner Fog's ForwardCom blue-sky paper architecture, which also aims to let code take advantage of future wider hardware without recompiling or redoing manual vectorization.)
What kind of asm code-gen are you hoping to get from a runtime-variable sized "vector" when targeting a machine that has fixed-width SIMD vectors, like a choice of 16 or 32 bytes, with the choice being made as part of the instruction encoding?