I am attempting to optimize some basic "statistics-lite" primitives (averages, medians and the like) in Rust. I am very happy with the performance of the compiled code, as it beats pretty much all the usual Python implementations, even without getting too far into the proverbial "unsafe" weeds. Specifically, on x86, following a helpful suggestion here, the resulting code compiles into something that looks glorious, - and performs accordingly,- as follows:
vaddsd xmm0, xmm0, qword ptr [rdi + 8*rcx]
vaddsd xmm7, xmm7, qword ptr [rdi + 8*rcx + 8]
vaddsd xmm5, xmm5, qword ptr [rdi + 8*rcx + 16]
vaddsd xmm4, xmm4, qword ptr [rdi + 8*rcx + 24]
vaddsd xmm3, xmm3, qword ptr [rdi + 8*rcx + 32]
vaddsd xmm6, xmm6, qword ptr [rdi + 8*rcx + 40]
vaddsd xmm2, xmm2, qword ptr [rdi + 8*rcx + 48]
vaddsd xmm18, xmm18, qword ptr [rdi + 8*rcx + 56]
vaddsd xmm17, xmm17, qword ptr [rdi + 8*rcx + 64]
vaddsd xmm15, xmm15, qword ptr [rdi + 8*rcx + 72]
vaddsd xmm14, xmm14, qword ptr [rdi + 8*rcx + 80]
vaddsd xmm13, xmm13, qword ptr [rdi + 8*rcx + 88]
vaddsd xmm12, xmm12, qword ptr [rdi + 8*rcx + 96]
vaddsd xmm10, xmm10, qword ptr [rdi + 8*rcx + 104]
vaddsd xmm9, xmm9, qword ptr [rdi + 8*rcx + 112]
vaddsd xmm8, xmm8, qword ptr [rdi + 8*rcx + 120]
add rcx, 16
cmp rax, rcx
jne .LBB0_2
vxorpd xmm1, xmm1, xmm1
vaddsd xmm16, xmm0, xmm1
In hopes of the compiler doing the same for a "vector compare" primitive (used for the calculations of ranks, medians etc), I modified the snippet as follows:
const LANES: usize = 16;
pub fn simd_countsmaller(values: &[f64], val: f64) -> usize {
let chunks = values.chunks_exact(LANES);
let remainder = chunks.remainder();
let sum = chunks.fold([0; LANES], |mut acc, chunk| {
let chunk: [f64; LANES] = chunk.try_into().unwrap();
for i in 0..LANES {
acc[i] += (chunk[i] < val) as usize;
}
acc
});
let remainder = remainder.iter().copied().filter(|&n| n < val).count();
let mut reduced = 0;
for i in 0..LANES {
reduced += sum[i];
}
reduced + remainder
}
Unfortunately, this seems to be about 4x slower (~ 400ms vs 100ms for 10000 samples) than a simple values.iter().filter(|x| **x < val).count()
I can see why in the compiled Assembly:
vucomisd xmm0, qword ptr [rdi + 8*r10]
seta al
add r12, rax
xor eax, eax
vucomisd xmm0, qword ptr [rdi + 8*r10 + 8]
seta al
add rbx, rax
xor eax, eax
vucomisd xmm0, qword ptr [rdi + 8*r10 + 16]
seta al
add rdx, rax
xor eax, eax
vucomisd xmm0, qword ptr [rdi + 8*r10 + 24]
seta al
add rcx, rax
xor eax, eax
vucomisd xmm0, qword ptr [rdi + 8*r10 + 32]
seta al
add rsi, rax
xor eax, eax
vucomisd xmm0, qword ptr [rdi + 8*r10 + 40]
seta al
add r9, rax
xor eax, eax
vucomisd xmm0, qword ptr [rdi + 8*r10 + 48]
seta al
add r8, rax
xor eax, eax
................
So my questions are, therefore, as follows:
If the above is broadly the right way to tell the compiler to generate SIMD code, is there something I need to do differently to ensure that it happens?
Otherwise, is there a different way to do what I am trying to do? Specifically, is there a way to do a fast "slice elementwise compare" that would outperform the iterator?