I've been trying to debug some vector-scalar integer addition performance issues and noticed that enabling/disabling SIMD instructions doesn't make a difference in performance. Am I doing something wrong, or is that expected?
Here is the Rust function I'm trying:
#[inline(never)]
fn add_slice(
input: &[i64],
output: &mut [i64],
) {
for (i, &x) in input.into_iter().enumerate() {
output[i] = x.wrapping_add(9999);
}
}
Here is how I'm compiling it:
features="+avx,+avx2,+sse,+sse2,+sse3"
#features="-avx,-avx2,-sse,-sse2,-sse3"
rustc -C opt-level=3 -C target-feature="$features" temp.rs --emit=asm
rustc -C opt-level=3 -C target-feature="$features" temp.rs
./temp
When I inspect the assembly, the version using SIMD does this:
vpaddq (%rdi,%rax,8), %ymm0, %ymm1
vpaddq 32(%rdi,%rax,8), %ymm0, %ymm2
vpaddq 64(%rdi,%rax,8), %ymm0, %ymm3
vpaddq 96(%rdi,%rax,8), %ymm0, %ymm4
Whereas the version without just uses a single addq
. These are both as I expected.
But when I run the two, they each take 1.3ns per integer on average on my processor. My processor is an old 2014 i5, but I'm slightly surprised at how slow this is, especially for SIMD. Shouldn't it be able to process multiple integers per cycle and take <1ns per integer? And why might both versions take the same time?