4

I've been trying to get up to speed on where we are taking advantage of vectorisation.

Of course the answer to optimisation is always to profile, make a change and profile again but you don't necessarily know what CPU will be used when your application is deployed let alone what capabilities will be in the next CPU around the corner.

It seems the best option is AoSoA style programming.

So we kind of know collectively that the layout of a structure should be something like (simplified pseudo-code):

struct block
{
   ALIGN_AND_PAD int16_t field1[blockSize];
   ALIGN_AND_PAD int32_t field2[blockSize];       
};
struct AoSoA
{
   AoS* block[arraySize/blockSize];
}

rather than:

std::vector< someStruct >

We can observe that if blockSize=1 we have AoS and if blockSize=arraySize we have SoA.

It is unclear what block size is best given various widths of buses and cache lines. So long as a block fits in the right multiple of 64.

Not so long ago AVX2 was introduced. This contains a gather instruction specifically aimed at "enabling vector elements to be loaded from non-contiguous memory locations". I dimly recall learning about gather scatter back in the 90s when I was using a Sparc (though I may have been reading a book about a CRAY or some such thing at the time).

Gather as a mainstream operation would appear to reduce the advantages of using AoSoA or rather reduce the disadvantages of using a conventional AoS layout. I think I am correct in assuming it is not a sufficient gain (yet) to render AoSoA obsolete.

If I want to make my code clean, future proof and performant on a wide variety of architectures how should I approach this problem?

How should I choose the appropriate block size and alignment?

My thinking is to roll my own and make block-size either a run or compile-time parameter and calculate strides and indices to access fields directly. i.e. write functions like:

Container::Container(blockSize);   //constructor
int16_t Container::getField1(index);
int32_t Container::getField2(index);
Container::insert(someStruct); //disassemble
someStruct Container::getStruct(index);  //reassamble

Is this sensible? I can't help being concerned that by putting the index calculation in my code rather then letting the compiler generate it I risk making things worse.

Why can't mainstream compilers like gcc & clang create this representation automatically as an optimisation pass and also decide what blockSize is best?

I think I saw an SoA annotation for an intel compiler somewhere and there are definitely a few research papers that suggest it.

There are a few template libraries that help create AoSoA for C++ but some are quite old, some seem compiler specific.

Is there any work towards making something more standard? For example a compiler annotation that would work in either gcc or clang or both or a Boost library?

Bruce Adams
  • 4,953
  • 4
  • 48
  • 111
  • Gather needs a vector of indices, and there's no sign yet of it being anywhere close to as as a contiguous SIMD load even if multiple elements happen to come from 1 cache line. On CPUs that support it fully efficiently (https://uops.info/), it accesses cache separately for each element (so at best a throughput of one per 4 cycles, for an 8-float gather), and decodes to about 4 uops. vs. `vaddps ymm0, ymm1, [rdi]` being a single uop with a load micro-fused to the ALU uop as it goes through the pipeline, with 2/clock throughput for both the 32-byte load and the FP add uops in the back-end. – Peter Cordes May 04 '22 at 18:25
  • FWIW you can get `std::vector` to align your structures just fine, but you may have to write a custom allocator to do so. But it should respect [`alignas`](https://en.cppreference.com/w/cpp/language/alignas) out of the box regardless. – Mgetz May 04 '22 at 18:25
  • @Mgetz: Yes, from C++17 onward, `std::vector` respects over-aligned `T`. Before that it potentially violates their `alignas` if you don't use a custom allocator. So in C++17 you can just use `alignas`. – Peter Cordes May 04 '22 at 18:27
  • 1
    *Why can't mainstream compilers like gcc & clang create this representation automatically as an optimisation pass and also decide what blockSize is best?* - Because data layout becomes part of the ABI for interoperability of separately-compiled source. Compiler devs *really* don't want optimization options like `-march=` to affect the ABI, thus they've [resisted defining `std::hardware_destructive_interference_size` (and constructive)](https://stackoverflow.com/q/39680206/224132), for example. – Peter Cordes May 04 '22 at 18:36

0 Answers0