Do gcc vector extensions support variable length vectors?

Question

This doesn t appears to work/compile

void vec(size_t n) {
    typedef char v4si __attribute__((vector_size(n)));
    v4si t={1};
}

Is there a proper way to declare this or is it unsupported?

Since for the size only positive power-of-two multiples of the base type are allowed, I don't think it is supported. The documentation says nothing about a non-constant size. -- You might consider to pre-define all types that are needed in your application, and use a switch to select the requested one. — the busybee, Jan 07 '21 at 07:39
What do you want to achieve here? Is writing a simple for loop and compiling with `-O3` an option? Is switching to C++ an option? (You could pass `n` as a template parameter, if it is known at compile time) — chtz, Jan 07 '21 at 08:23
@chtz the problem is gcc tend to not vectorize simple loops, hence the need to state it explicetely. — user2284570, Jan 07 '21 at 10:38
Alright, that makes for 1=2^0 to 10G=2^34 (approximately) bytes ... well, 35 different types. Not too much, if you want this. Remember, only power-of-two are allowed. -- However, the reason for such types is to use special machine code instructions to handle multiple data in parallel. But only as many as fit in a single register, AFAIK. -- I'm afraid, you are digging at the wrong place. — the busybee, Jan 07 '21 at 10:49
@thebusybee sorry I was meaning 1 byte. also, it doesn t have to be a power of 2. — user2284570, Jan 07 '21 at 10:51
But you read GCC's documentation, didn't you? Its extension limits the size to power-of-two multiples of the base type. — the busybee, Jan 07 '21 at 10:54
if the loop wasn't vectorized then you need to check the log to see why it's not vectorized and fix — phuclv, Jan 07 '21 at 13:33
An OpenMP SIMD annotation may get the compiler to vectorize. Often `#pragma omp simd` is enough; if you don't know how they work, see http://primeurmagazine.com/repository/PrimeurMagazine-AE-PR-12-14-32.pdf . Also worth noting that SVE doesn't have to be a power-of-two length; any multiple of (IIRC) 64 is allowed… you can have vectors of 384 bits, for example. — nemequ, Jan 07 '21 at 17:11
@nemequ given the array tend to be small, I m not sure a library call to openmp is the right way to do. Better to get it vectorized inlined in the code. — user2284570, Jan 07 '21 at 17:14
OpenMP **SIMD**. There is no library call; you don't even have to link to the OpenMP runtime library (gcc/clang: -fopenmp-simd). It's basically just cross-compiler a standard for annotating loops to help the compiler to vectorize. Works in most modern compilers, including GCC, clang, MSVC, ICC, Cray, Arm, xlc, etc. I suggest you read that article I linked to, it should explain things in more detail. — nemequ, Jan 07 '21 at 20:01

Peter Cordes · Answer 1 · 2021-01-07T20:23:07.887

1

No, that would make no sense. It's like trying to select uint32_t vs. uint64_t at runtime based on the value of some variable.

Manual vectorization does not work by treating the whole array as one giant SIMD vector, it works by telling the compiler exactly how to use fixed-size short vectors. If auto-vectorization doesn't work with normal arrays, this is not going to help.

To get GCC to "try harder" to auto-vectorize a loop if you don't want to do it manually, there #pragma omp SIMD with gcc -fopenmp which can auto-vectorize at -O2. Or compiling with -O3 will consider every loop as a candidate for auto-vectorization. (Also stuff on single structs; clang is generally better at finding SIMD use-cases in non-looping code than gcc, though. clang may sometimes be too aggressive and spend more time shuffling data together than it would cost to just do separate scalar work.)

But note that GCC and clang's auto-vectorization can only work if the loop trip-count can be calculated before the first iteration. It can be a runtime variable count, but an if()break; exit condition that could trigger at any time depending on data will defeat them. So e.g. they can't auto-vectorize a naive looping strlen or strchr implementation that uses while(*p++ != 0){...}. ICC can do that.

Also if you need any kind of shuffling, you'll often need to do that yourself with GNU C native vectors, or target-specific intrinsics like SSE/AVX for x86, NEON/AdvSIMD for ARM, AltiVec for Power, etc.

Cray machines apparently had SIMD that worked by giving the hardware a pointer + length and letting it "loop" in whatever chunks it wanted (maybe like how modern x86 rep movsd can actually use larger chunks in its microcode). But modern CPUs have fixed-width short-vector SIMD instructions that can for example do exactly 16 or exactly 32 bytes.

(ARM SVE is sort of part-way between, allowing forward compatibility for code to take advantage of wider vectors on future HW instead of fully baking in a vector width. It's still a fixed size you can't control, though. You still have to loop using it, and increment your pointer by the hardware's vector-width. It has masking stuff to ignore elements past the end of what you want to process so you can use it for arbitrarily short arrays, I think, and for the leftover end of an array. But for arbitrarily long arrays you still need to loop. Also, very few CPUs support SVE yet. BTW, SVE is a similar concept to SIMD in Agner Fog's ForwardCom blue-sky paper architecture, which also aims to let code take advantage of future wider hardware without recompiling or redoing manual vectorization.)

What kind of asm code-gen are you hoping to get from a runtime-variable sized "vector" when targeting a machine that has fixed-width SIMD vectors, like a choice of 16 or 32 bytes, with the choice being made as part of the instruction encoding?

edited Jan 07 '21 at 20:23

answered Jan 07 '21 at 20:06

Peter Cordes

328,167
45
605
847

Though there s a special case which is initializing a variable length array. The problem here is I need to do it at a fixed address (which as far I know is only covered by cilk). **Also one of your statement is bit wrong as arm now feature scalable vector extension working like cray machines allowed it**. Unlike Neon, in that case, the vector length can be anything. – user2284570 Jan 07 '21 at 20:12
2

Actually Peter is right; SVE doesn't quite work like that. There is a function for determining length, and you loop through your data while incrementing a counter. In the loop you generate a mask basked on the number of elements remaining (based on the length and the counter) so for the last iteration the machine will mask out any remaining elements and not touch them. See https://developer.arm.com/documentation/101726/0210/Coding-for-Scalable-Vector-Extension--SVE-/SVE-Vector-Length-Agnostic--VLA--programming/For-and-While-loop-vectorization?lang=en for an example. – nemequ Jan 07 '21 at 20:16
@user2284570: Are you looking for `memset` or `wmemset`, if either of those match your array's element size? GCC can inline them, or call an efficient libc implementation. Also, SVE is interesting and worth mentioning, but SVE is not like Cray vectorization or x86 `rep stosd`. Thanks for reminding me of it. – Peter Cordes Jan 07 '21 at 20:21
@PeterCordes I am reimplementing `memset()` for my custom allocator. – user2284570 Jan 07 '21 at 20:24
1

@user2284570: what are you hoping to gain over calling `memset`? GCC and clang are generally pretty good at compiling calls to memset. Manually vectorizing could at best be good for some microarchitectures that are similar to the only you're tuning on, but GCC already knows memset strategies that are good on different microarchitectures for different ISAs. (At least in theory; sometimes there can be missed optimizations which you can report on GCC's bugzilla.) – Peter Cordes Jan 07 '21 at 20:27
@user2284570: Also, GCC will generally recognize `for(...) arr[i]=x;` as a memset pattern (if array elements have size == 1 or `x` is a repeating byte pattern like `0` - [Why is std::fill(0) slower than std::fill(1)?](https://stackoverflow.com/q/42558907)) and turn it into a memset. But if the pattern repeats at longer than 1 byte, GCC unfortunately doesn't look for wmemset, it just auto-vectorizes if it can. – Peter Cordes Jan 07 '21 at 20:30
If there are too much basic blocks, gcc will not attempt to optimize some in order to avoid taking too much compile time or too much memory. I am using a differrent semantic than the C `memset()` so the behavior is nearer than `memset_s()` but still differrent. – user2284570 Jan 07 '21 at 21:15
About arm, yes, sorry I did the mistake to think about it. I was meaning the Vector est feature of risc 5 with the `setvl` instruction. – user2284570 Jan 07 '21 at 21:23
@user2284570: RISC-V still has a MAXVL dependent on the machine. But yes, the RISC-V manual itself even mentions similarity to Cray. https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf. Interesting. I knew Patterson really didn't like the short-vector SIMD that mainstream ISAs have used in recent years, but hadn't known RISC-V had SIMD at all. But anyway, GCC's `__attribute__((vector_size(n)))` defines a fixed-width type, and isn't related to that RISC-V extension. – Peter Cordes Jan 07 '21 at 21:46
@user2284570: Can't you just do some size checks outside a loop to figure out a pointer+length you can pass to memset? – Peter Cordes Jan 07 '21 at 21:47
@PeterCordes no I can t. – user2284570 Jan 07 '21 at 21:48
@user2284570: Ok, so what data-dependent checks do need to happen inside the loop? Are you suggesting RISC-V's variable-length SIMD could help with this when memset can't? (Or some hypothetical GCC `__attribute__((vector_size(n)))` that you use exactly once? The example code in this question appears to be exactly identical to what you could do with memset.) Anyway, the answer to this question is a clear "no", but you could ask a different question about your real memset-like thing that can't calculate ahead of time when it should stop. – Peter Cordes Jan 07 '21 at 22:15
@user2284570: I looked again at RISC-V. It's similar to SVE. The MAXVL is likely to be some small power of 2 since it does still use vector registers, and you do have to loop. e.g. Figure 17.1 in the v2.2 spec shows a loop that uses `setvl t0, a0` every iteration, and `elements_left -= t0`. It does make handling of possibly-short arrays efficient, along with cleanup for trailing bytes, if you choose to leave `setvl` inside the loop like that. So you get branchless handling of short cases, but other than that not fundamentally different to what you can do with SSE or other fixed-vector SIMD – Peter Cordes Jan 07 '21 at 22:23

Do gcc vector extensions support variable length vectors?

1 Answers1