4

I discover programming with vectorized data types for SIMD instructions (with this tutorial). From what I understand, a vector has a fixed size of 16 bytes. This schematic details it well and seems to answer my question:

SIMD vectors

A set of instructions including the basic operations (but also some more specific ones) is provided.

Nevertheless, just out of curiosity, I would like to know if there was a way to vectorize "custom data", and by that I mean mostly structures. I suppose that if the size of the structure is within the 16 byte range, it is possible, because in the end, the types are only byte sizes, however the instruction set does not seem to allow to operate directly on structures, for example to get a field.

So my question is the following: are we limited to the simple standard C types when vectorizing and SIMD operations? If not, how do we proceed? If yes, are there parallelization methods (other than multithreading) to operate simultaneously on structure vectors / arrays?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Foxy
  • 980
  • 8
  • 17

1 Answers1

4

_mm_loadu_si128 / _mm_storeu_si128 are strict-aliasing safe so you can use them on anything. The equivalents for ARM NEON are similar.

If you know the struct layout (which is fixed for a given ABI), you most certainly can load/store data in large chunks from structs or arrays of structs. e.g. Fast interleave 2 double arrays into an array of structs with 2 float and 1 int (loop invariant) member, with SIMD double->float conversion? does packed conversion then shuffle and blend. Another example: Sorting 64-bit structs using AVX?

Most of what you can do with asm is possible in C with intrinsics.


If you want to do different things to each struct member though, then you usually have a problem. e.g. a struct xy { float x,y; }; geometry vector is a poor fit for SIMD. Adding is fine (it's pure vertical), but dot product or rotation requires combining the x and y components of a single geometry vector, horizontally within a SIMD vector. Shuffling costs extra instructions.

This is the Array of Structs problem, and is usually best solved by storing your data as one struct of arrays. So you'd have float x[] and float y[], so you can do a whole SIMD vector of four dot-products at once between x[i + 0..3], y[i + 0..3] and x[j + 0..3], y[j + 0..3].

See https://stackoverflow.com/tags/sse/info for some links, specifically Slides: SIMD at Insomniac Games (GDC 2015) which also transcribes the text of the talk along with each slide. It has a more gradual introduction to these concepts, with some diagrams.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • You might add that the addresses passed to the SIMD instructions must be aligned on 16 byte boundaries (for intel at least). Memory allocated with `malloc()` is correctly aligned for systems with SIMD instructions, but structure members with the expected layout might not be correctly aligned. – chqrlie Apr 19 '20 at 20:30
  • @chqrlieforyellowblockquotes: I used `_mm_loadu_si128` and storeu - unaligned load/store. Modern x86 can do that fairly efficiently, although without AVX it defeats some optimizations, and cache-line splits still have a penalty. – Peter Cordes Apr 19 '20 at 20:34
  • OK, there seems to be no end in sight for the instruction bloat in modern CPUs. So SIMD can operate on any alignment. You might add a note explaining this for programmers not up to date with the latest tricks (such as I obviously). – chqrlie Apr 19 '20 at 20:38
  • 1
    @chqrlieforyellowblockquotes: I don't think there's any need to get into the nitty gritty of implementation details on any particular ISA; there's tons of other things I'm not mentioning either, like masked stores to only modify one struct member. The OP linked a PS3 Cell tutorial so even talking about x86 might be the wrong alley. Plus you can use aligned loads on an array of structs; every SIMD ISA has ways of dealing with misaligned data. – Peter Cordes Apr 19 '20 at 20:45