Why don't SIMD instructions improve performance? (Mac, Intel x86_64)

Question

I've been trying to debug some vector-scalar integer addition performance issues and noticed that enabling/disabling SIMD instructions doesn't make a difference in performance. Am I doing something wrong, or is that expected?

Here is the Rust function I'm trying:

#[inline(never)]
fn add_slice(
    input: &[i64],
    output: &mut [i64],
) {
    for (i, &x) in input.into_iter().enumerate() {
      output[i] = x.wrapping_add(9999);
    }
}

Here is how I'm compiling it:

features="+avx,+avx2,+sse,+sse2,+sse3"
#features="-avx,-avx2,-sse,-sse2,-sse3"
rustc -C opt-level=3 -C target-feature="$features" temp.rs --emit=asm
rustc -C opt-level=3 -C target-feature="$features" temp.rs
./temp

When I inspect the assembly, the version using SIMD does this:

vpaddq  (%rdi,%rax,8), %ymm0, %ymm1
vpaddq  32(%rdi,%rax,8), %ymm0, %ymm2
vpaddq  64(%rdi,%rax,8), %ymm0, %ymm3
vpaddq  96(%rdi,%rax,8), %ymm0, %ymm4

Whereas the version without just uses a single addq. These are both as I expected.

But when I run the two, they each take 1.3ns per integer on average on my processor. My processor is an old 2014 i5, but I'm slightly surprised at how slow this is, especially for SIMD. Shouldn't it be able to process multiple integers per cycle and take <1ns per integer? And why might both versions take the same time?

Are you maybe bottlenecked on memory bandwidth on that older CPU? Or worse on page faults? (See [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)). Even scalar should be able to load+sum+store about 1 element per clock cycle if it doesn't stall for cache misses, but 1.3 ns is multiple clock cycles at 3 to 4 GHz. Have you tried using `u8` or something, which will reduce cache footprint and memory bandwidth per element by a factor of 8. And AVX2 will be a 32x theoretical speedup over scalar, not 4x. — Peter Cordes, May 16 '23 at 02:45
Maybe try selecting *specific* SIMD levels (AVX but not AVX2, also don't forget to add ssse3 with 3 `s`s). Sometimes a CPU supports new *instructions* but only implements them serially. Also make sure your benchmarking is actually measuring something meaningful of course. — o11c, May 16 '23 at 02:46
Your asm snippet left out the store part, I guess? The loop is loading + adding + storing, doing `dst[i] = src[i] + 9999` with wrapping_add, right? — Peter Cordes, May 16 '23 at 02:47
@o11c: There are no Intel CPUs that implement `vpaddq` as 4x scalar `add` operations. Until Alder Lake E-core, there were no mainstream Intel CPUs that supported AVX2 `vpaddq ymm` but only handled it as two 128-bit halves. (Like AMD Bulldozer-family and Zen 1). Intel big-core CPUs have full-width SIMD ALUs. Also, all Intel CPUs with AVX2 have 32-byte load/store data-paths (except Alder Lake E-cores, which a 2014 mac definitely doesn't have.) And data misalignment penalties for 32-byte vectors are low on Haswell and later. (Earlier Intel only supported AVX1 so couldn't run this AVX2 code.) — Peter Cordes, May 16 '23 at 02:50
I'm benchmarking the time taken to run `add_slice` on a vector of 1M numbers 100 times, and it comes out to ~130ms. I've also tried i32 and u32, which runs ~2x as fast (makes sense). @PeterCordes the vpaddq's are preceded by a single vpbroadcastq and followed by 4 vmovdqu's. — mwlon, May 16 '23 at 03:22
Ah, I think memory bandwidth is the problem. I didn't expect it to be since my processor's stated memory bandwidth is ~25GB/s, but when I benchmark on smaller vectors that fit in a smaller cache, the difference is apparent: running slices of length 1000 (100k times), the scalar version takes 0.54ns / integer, whereas the SIMD version takes 0.09ns / integer. — mwlon, May 16 '23 at 04:01
Keep in mind that storing to a cache line requires reading it first, before the store can commit from the store buffer to L1d cache. (Unless your compiler uses `vmovntdq` cache-bypassing stores, which are *only* appropriate for problems that won't fit in cache.) So `A[] = B[]+const` touches 16 bytes per element, but costs 24 bytes of bandwidth per element. See https://www.cs.virginia.edu/stream/ref.html#counting - your benchmark is equivalent to STREAM "scale", except it's doing integer add instead of FP multiply. So the actual memory bandwidth required is 1.5x the "useful" bandwidth. — Peter Cordes, May 16 '23 at 04:13
Also related re: memory performance: Intel "client" (desktop / laptop) CPUs typically can use most of the available DRAM bandwidth with a single core. But on many-core Xeons, higher latencies mean the same amount of memory-level parallelism (e.g. LFBs or superqueue entries to track incoming / outgoing cache lines from L1 or L2) gives lower per-core bandwidth, much lower than the higher DRAM bandwidth from those big CPUs with more memory controllers. So your CPU can max out DRAM from one core, but [it's not a safe assumption in general](https://stackoverflow.com/q/39260020). — Peter Cordes, May 16 '23 at 04:17
At 1.3ns per element, 24 bytes/1.3ns = 18.4 GB/s. So that's a respectable fraction of 25GB/s. If your benchmark was somehow transferring 32B per element of DRAM bandwidth, that would be 24.6 GB/s. But probably it's just not able to max out DRAM bandwidth from a single core, at least when doing mixed read + write. — Peter Cordes, May 16 '23 at 04:22

Why don't SIMD instructions improve performance? (Mac, Intel x86_64)

0 Answers0