Can I use AVX/SSE with swizzling for an AoS layout, instead of SoA?

Question

I want to speed up a simple integrator that describes a set of massless particles by their position and speed. I'm not a SSE/AVX expert, but I find it interesting what SIMD extensions could yield here.

Many papers suggest using a structure of arrays:

struct {
  static float2 xy[OUGHTA_BE_ENOUGH];
  static float2 vxvy[OUGHTA_BE_ENOUGH];
} Particles;

// in main loop:
Particles.xy[i] += time_delta * Particles.vxvy[i];

However, for many applications the opposite approach would be beneficial:

struct {
  float2 xy;
  float2 vxvy;
} Particle;

// in main loop:
particles[i].xy += time_delta * particles[i].vxvy;

Although I vaguely understand what to search to vectorise the structure-of-arrays version, I doubt there is any way to use SIMD with the array-of-structures version because of field access, or “swizzling”.

Are there any techniques to do the computation above with SIMD, or maybe there are intrinsics I missed?

C != C++. Tag only with the language that you're using, unless both are actually relevant. — tambre, Dec 02 '17 at 07:21
@tambre, you are absolutely right — I just thought this question is indeed relevant for both of them. — Schmo, Dec 02 '17 at 07:24
@tambre: The question is about x86 AVX instructions, which you can use from C or C++ (manually with intrinsics, or by the compiler with auto-vectorization). It's not *about* how either C or C++ work. If we needed room for any more tags, though, one of those would be the first to go. In fact, this should probably be tagged with `intrinsics` if that's how the OP is considering vectorizing. — Peter Cordes, Dec 02 '17 at 07:44
You don't literally need a `struct` of arrays; it's not important that the distance between `xy[i]` and `vxvy[i]` is a compile-time constant. (It could potentially save a register, and a bit of loop overhead for another pointer increment, though.) — Peter Cordes, Dec 02 '17 at 07:47
What the heck is `float2`, and what does `operator*` do when multiplying it by `time_delta` (which I assume is a scalar `float`)? Is it just element-wise multiplication? A fully SOA approach would use separate `float x[]` and `float y[]` arrays (or dynamically allocated buffers). Doesn't matter here, but when you want the distance between a point and some other points, it avoids any shuffling. — Peter Cordes, Dec 02 '17 at 07:49
This is the sort of thing I would consider shoving off onto the GPU, examples of which are easy to find. — SoronelHaetir, Dec 02 '17 at 07:51

Peter Cordes · Accepted Answer · 2017-12-03T07:38:53.080

See the sse tag wiki for some links, especially SIMD at Insomniac Games (GDC 2015). Looping over a lot of particles is exactly the same problem as looping over a lot of objects in a game world, so this kind of loop is mentioned, and the problems with trying to use AoS.

You don't literally need a struct of arrays; it's not important that the distance between xy[i] and vxvy[i] is a compile-time constant. (It could potentially save a register, and a bit of loop overhead for another pointer increment, though. But seriously, most people don't use giant structs, they use separately-allocated arrays if the size isn't known at compile time. They might keep all the pointers in a struct, though.)

You (or a compiler) can shuffle and gain a speedup over scalar for your AoS approach, but if you're looping over every particle anyway, you're shooting yourself in the foot by doing that. Your float2 xy pairs only come in 64-bit chunks, so you can't usefully use 128-bit stores. Using twice as many 64-bit stores sucks: you're losing out on half the power of SSE, or 75% of the power of AVX. On top of that, you're spending extra instructions shuffling or loading as well as storing.

Moving data around costs as much or more than the actual multiply/add, especially for throughput not latency. Even with a perfect SoA layout, it wouldn't be the FMA units that are the bottleneck, it would be load/store throughput, or total instruction throughput (CPU front-end) without significant loop unrolling. Adding any overhead on top of that means you just bottleneck on the front end (or on shuffle throughput).

There's no way around dirtying the cache lines containing vxvy whether you store to vxvy explicitly or not, when they're interleaved with xy Thus you will always need twice the store bandwidth vs. SoA for a problem like this.

With AVX for your bad AoS layout, it would be worth it to manually shuffle and then do a 256b store that stores the new xy values and rewrites vxvx with the value you loaded earlier, but the compiler isn't allowed to do that when auto-vectorizing unless whole-program optimization proves that this code is single-threaded. The C11 and C++11 memory models agree that it's not a data race for one thread to write some array elements or struct members while other threads are writing other elements. A non-atomic read-modify-write of the vxvy members isn't allowed when the source only reads those elements. (i.e. the compiler isn't allowed to invent writes to memory locations / objects not written by the original code, even if it's rewriting the data that was there originally.) Of course if you do it manually with intrinsics, the compiler can do it. Maybe even particles[i].vxvy = particles[i].vxvy; would give the compiler license to read / shuffle / rewrite if it wanted.

Actually, a compiler could vectorize this way by using vmaskmovps to do masked stores instead of regular vmovups stores. It's slower than regular stores (Haswell/Skylake: 4 fused-domain uops that need p0, p1, p4, and p23, vs. a regular store being a single micro-fused uop that needs p4 and p237). Normally you want to avoid it, but auto-vectorizing this with it might still be better than separate 64-bit stores, when the compiler needs to avoid rewriting the vxvy bytes. Especially for AVX512, masked stores would allow auto-vectorization with 512b (64-byte) vectors that store 4 xy pairs at once. (Instead of 8 for an SoA format).

I checked on how gcc and ICC would auto-vectorize your first version, where xy are still in AoS, but in a layout that matches vxvy, so it can auto-vectorize with pure vertical SIMD ops. (source + asm output on the Godbolt compiler explorer). gcc does ok, making a loop with a single vfmadd213ps instruction. ICC seems to get confused by the float2 struct, and (I think) actually shuffles to de-interleave before multiply/add and then re-interleave after! (I didn't let ICC use AVX or AVX512, because longer vectors mean more shuffling, so it's even harder to see what it's doing.) This is one of the rare times ICC auto-vectorizes worse than gcc.

gcc and ICC both fail to auto-vectorize update_aos at all. Here's how I manually vectorized it (for AVX + FMA):

// struct definitions and float2 operator*(float scalar, const float2 &f2)
// included in the Godbolt link, see above.

#include <immintrin.h>
void update_aos_manual(Particle *particles, size_t size, float time_delta_scalar)
{
    __m256 time_delta = _mm256_set1_ps(time_delta_scalar);
    // note: compiler can't prove this loop isn't infinite.  (because i+=2 could wrap without making the condition false)
    for(size_t i=0 ; i<size ; i+=2) {
        float *ptr = (float*)(particles + i);
        __m256 p = _mm256_load_ps(ptr); // xy0 vxvx0 | xy1 vxvy1
        __m256 vx0 = _mm256_unpackhi_ps(p, _mm256_setzero_ps()); // vx0  0 | vx1   0
        p = _mm256_fmadd_ps(time_delta, vx0, p);   // p = td*vx0 + p
        _mm256_store_ps(ptr, p);

        //particles[i].xy += time_delta * particles[i].vxvy;
        //particles[i].vxvy += 0.0f * particles[i].vxvy;
    }
}

With gcc and ICC, this compiles to an inner loop like

 ## gcc7.2 -O3 -march=haswell
 # various scalar setup for the loop:
     vbroadcastss    ymm0, xmm0        # ymm0 set1(time_delta_scalar)
     vxorps  xmm3, xmm3, xmm3          # ymm3 = setzero_ps
.L27:
    vmovaps ymm2, YMMWORD PTR [rdi]        # load 2 struct Particle
    add     rdi, 32                        # ptr+=2 (16 bytes per element)
    vunpckhps       ymm1, ymm2, ymm3       # unpack high half of each lane with zero
    vfmadd132ps     ymm1, ymm2, ymm0       # xy += delta*vxvy; vxvy += delta*0
    vmovaps YMMWORD PTR [rdi-32], ymm1    # store back into the array
    cmp     rdi, rax
    jne     .L27

This wastes half its store bandwidth (unavoidable), and half its FMA throughput, but I don't think you can do any better. Well, unrolling would help, but shuffling / deshuffling and using fewer FMA would probably still bottleneck on the front-end. With unrolling, you can get this to run at almost one 32B store per clock (4 fused domain uops per clock) on Haswell/Skylake.

@Schmo: It is a thing, but I had to earn it myself: [I was the first SO user with a gold badge in the `[x86]` tag](https://stackoverflow.com/help/badges/6031/x86). (Paul R got his recently, and has SSE and AVX gold badges). Anyway, an "accept" vote would be fine if this answer solved your problem. (checkbox under the up/down vote arrows.) — Peter Cordes, Dec 03 '17 at 06:17
Was just about to do that. That was a wordplay on “Reddit Gold”, sorry I was unclear. I just wanted to note that I really did not expect that amount of input from a top contributor to my silly question. — Schmo, Dec 03 '17 at 06:19
@Schmo: I'm aware, sometimes I just like to brag about my nerd points and be an Internet badass :) And BTW, this answer ended up a bit longer than I set out to write. That always happens to me, xD. — Peter Cordes, Dec 03 '17 at 06:19
@Schmo: BTW, compilers actually *could* auto-vectorize when modifying data interleaved with untouched data using `vmaskmovps` masked stores. Updated my answer. (gcc and ICC *don't* in this case, but they'd be allowed to and still be totally thread-safe.) Anyway, manual vectorization that rewrites `vxvy` still seems like the best bet *if* you were forced to use this interleaved data format (e.g. if this was just a small part of a program and cache locality for scattered `xy[i]` and `vxvy[i]` was necessary for performance in the main hotspot.) — Peter Cordes, Dec 03 '17 at 07:42

score 2 · Answer 2 · answered Dec 02 '17 at 09:08

2

The approach suggested in the paper might be better because say you compile with SSE, the compiler can process 4 floating point operations at the same time. So the loop might be expanded to process two particles at the same time. With gcc and O3 it actually happens, just look at the disassembly.

With your suggested approach, this is still possible but it requires a swizzling overhead.

If you are concerned about keeping a single linear memory access, with my understanding of modern processors, it will not make a difference here. Processors can handle multiple linear memory access with respect to caching.

Now, since your are simulating a set of particles, I would recommend to vectorize across multiple particles.

struct Pack
{
    floatN x; // N = 2, 4, 8 depending on your vectorization target
    floatN y;
};

This way you can process multiple particles at the same time and it's easier to write the vectorisation.

f(Pack& pos, Pack speed) {
    pos.x = add(pos.x, speed.x);
    pos.y = add(pos.y, speed.y);
}

The downside is that it might not apply if a computation involves two particles at the same time: particleASpeed += force(particleBPos, particleAPos).

Also, consider using OpenCL or Cuda for this kind of computations and consider running them on GPUs.

Finally, although it might have come sooner, remember to measure what you are trying to optimize before and after. What I suggest is at best an educated guess and might not change anything depending on your actual implementation and size of your problem.

answered Dec 02 '17 at 09:08

Guillaume Gris

2,135
17
34

2

Interesting guess at what the OP was thinking was better with the AoS approach. Yes, modern CPUs can track multiple prefetch streams. (On Intel CPUs, the L2 streamer can track one forward and one backward sequential access pattern per 4k page.) – Peter Cordes Dec 02 '17 at 09:15
1

And yes, grouping into chunks of x and chunks of y can be good, but can also defeat compiler auto-vectorization. And it means you have to parameterize your source for different vector widths. – Peter Cordes Dec 02 '17 at 09:17
1

What I generally do is I use the maximum width I potentially target, say 16floats, and I use a wrapper that decomposes it in multiple smaller width instructions when this is not available. Vectorization is hard, de-vectorization is trivial. – Guillaume Gris Dec 02 '17 at 09:19
2

Yeah that's true, but less future-proof (AVX1024 someday?). Also, more interleaving than needed means worse spatial locality if you are accessing the x and y of a single (or some scattered) points. If you ever do that, might be worth interleaving as finely as possible. – Peter Cordes Dec 02 '17 at 09:21
1

With my understanding of particle physics, I would first say that not all these attributes will vary and secondly some of these attributes might even be defining constants. You might get a speedup by grouping computations for similar particles. As for deciding on the best memory pattern in the situation you describe, I won't speculate any thurther on this. Try the two and measure in a realistic simulation setup (size of dataset, multithreading, …) – Guillaume Gris Dec 03 '17 at 08:19

Can I use AVX/SSE with swizzling for an AoS layout, instead of SoA?

2 Answers2