See the sse tag wiki for some links, especially SIMD at Insomniac Games (GDC 2015). Looping over a lot of particles is exactly the same problem as looping over a lot of objects in a game world, so this kind of loop is mentioned, and the problems with trying to use AoS.
You don't literally need a struct of arrays; it's not important that the distance between xy[i]
and vxvy[i]
is a compile-time constant. (It could potentially save a register, and a bit of loop overhead for another pointer increment, though. But seriously, most people don't use giant structs, they use separately-allocated arrays if the size isn't known at compile time. They might keep all the pointers in a struct, though.)
You (or a compiler) can shuffle and gain a speedup over scalar for your AoS approach, but if you're looping over every particle anyway, you're shooting yourself in the foot by doing that. Your float2 xy
pairs only come in 64-bit chunks, so you can't usefully use 128-bit stores. Using twice as many 64-bit stores sucks: you're losing out on half the power of SSE, or 75% of the power of AVX. On top of that, you're spending extra instructions shuffling or loading as well as storing.
Moving data around costs as much or more than the actual multiply/add, especially for throughput not latency. Even with a perfect SoA layout, it wouldn't be the FMA units that are the bottleneck, it would be load/store throughput, or total instruction throughput (CPU front-end) without significant loop unrolling. Adding any overhead on top of that means you just bottleneck on the front end (or on shuffle throughput).
There's no way around dirtying the cache lines containing vxvy
whether you store to vxvy
explicitly or not, when they're interleaved with xy
Thus you will always need twice the store bandwidth vs. SoA for a problem like this.
With AVX for your bad AoS layout, it would be worth it to manually shuffle and then do a 256b store that stores the new xy
values and rewrites vxvx
with the value you loaded earlier, but the compiler isn't allowed to do that when auto-vectorizing unless whole-program optimization proves that this code is single-threaded. The C11 and C++11 memory models agree that it's not a data race for one thread to write some array elements or struct members while other threads are writing other elements. A non-atomic read-modify-write of the vxvy
members isn't allowed when the source only reads those elements. (i.e. the compiler isn't allowed to invent writes to memory locations / objects not written by the original code, even if it's rewriting the data that was there originally.) Of course if you do it manually with intrinsics, the compiler can do it. Maybe even particles[i].vxvy = particles[i].vxvy;
would give the compiler license to read / shuffle / rewrite if it wanted.
Actually, a compiler could vectorize this way by using vmaskmovps
to do masked stores instead of regular vmovups
stores. It's slower than regular stores (Haswell/Skylake: 4 fused-domain uops that need p0, p1, p4, and p23, vs. a regular store being a single micro-fused uop that needs p4 and p237). Normally you want to avoid it, but auto-vectorizing this with it might still be better than separate 64-bit stores, when the compiler needs to avoid rewriting the vxvy
bytes. Especially for AVX512, masked stores would allow auto-vectorization with 512b (64-byte) vectors that store 4 xy
pairs at once. (Instead of 8 for an SoA format).
I checked on how gcc and ICC would auto-vectorize your first version, where xy
are still in AoS, but in a layout that matches vxvy
, so it can auto-vectorize with pure vertical SIMD ops. (source + asm output on the Godbolt compiler explorer). gcc does ok, making a loop with a single vfmadd213ps
instruction. ICC seems to get confused by the float2
struct, and (I think) actually shuffles to de-interleave before multiply/add and then re-interleave after! (I didn't let ICC use AVX or AVX512, because longer vectors mean more shuffling, so it's even harder to see what it's doing.) This is one of the rare times ICC auto-vectorizes worse than gcc.
gcc and ICC both fail to auto-vectorize update_aos
at all. Here's how I manually vectorized it (for AVX + FMA):
// struct definitions and float2 operator*(float scalar, const float2 &f2)
// included in the Godbolt link, see above.
#include <immintrin.h>
void update_aos_manual(Particle *particles, size_t size, float time_delta_scalar)
{
__m256 time_delta = _mm256_set1_ps(time_delta_scalar);
// note: compiler can't prove this loop isn't infinite. (because i+=2 could wrap without making the condition false)
for(size_t i=0 ; i<size ; i+=2) {
float *ptr = (float*)(particles + i);
__m256 p = _mm256_load_ps(ptr); // xy0 vxvx0 | xy1 vxvy1
__m256 vx0 = _mm256_unpackhi_ps(p, _mm256_setzero_ps()); // vx0 0 | vx1 0
p = _mm256_fmadd_ps(time_delta, vx0, p); // p = td*vx0 + p
_mm256_store_ps(ptr, p);
//particles[i].xy += time_delta * particles[i].vxvy;
//particles[i].vxvy += 0.0f * particles[i].vxvy;
}
}
With gcc and ICC, this compiles to an inner loop like
## gcc7.2 -O3 -march=haswell
# various scalar setup for the loop:
vbroadcastss ymm0, xmm0 # ymm0 set1(time_delta_scalar)
vxorps xmm3, xmm3, xmm3 # ymm3 = setzero_ps
.L27:
vmovaps ymm2, YMMWORD PTR [rdi] # load 2 struct Particle
add rdi, 32 # ptr+=2 (16 bytes per element)
vunpckhps ymm1, ymm2, ymm3 # unpack high half of each lane with zero
vfmadd132ps ymm1, ymm2, ymm0 # xy += delta*vxvy; vxvy += delta*0
vmovaps YMMWORD PTR [rdi-32], ymm1 # store back into the array
cmp rdi, rax
jne .L27
This wastes half its store bandwidth (unavoidable), and half its FMA throughput, but I don't think you can do any better. Well, unrolling would help, but shuffling / deshuffling and using fewer FMA would probably still bottleneck on the front-end. With unrolling, you can get this to run at almost one 32B store per clock (4 fused domain uops per clock) on Haswell/Skylake.