2

I am attempting to optimize some basic "statistics-lite" primitives (averages, medians and the like) in Rust. I am very happy with the performance of the compiled code, as it beats pretty much all the usual Python implementations, even without getting too far into the proverbial "unsafe" weeds. Specifically, on x86, following a helpful suggestion here, the resulting code compiles into something that looks glorious, - and performs accordingly,- as follows:

        vaddsd  xmm0, xmm0, qword ptr [rdi + 8*rcx]
        vaddsd  xmm7, xmm7, qword ptr [rdi + 8*rcx + 8]
        vaddsd  xmm5, xmm5, qword ptr [rdi + 8*rcx + 16]
        vaddsd  xmm4, xmm4, qword ptr [rdi + 8*rcx + 24]
        vaddsd  xmm3, xmm3, qword ptr [rdi + 8*rcx + 32]
        vaddsd  xmm6, xmm6, qword ptr [rdi + 8*rcx + 40]
        vaddsd  xmm2, xmm2, qword ptr [rdi + 8*rcx + 48]
        vaddsd  xmm18, xmm18, qword ptr [rdi + 8*rcx + 56]
        vaddsd  xmm17, xmm17, qword ptr [rdi + 8*rcx + 64]
        vaddsd  xmm15, xmm15, qword ptr [rdi + 8*rcx + 72]
        vaddsd  xmm14, xmm14, qword ptr [rdi + 8*rcx + 80]
        vaddsd  xmm13, xmm13, qword ptr [rdi + 8*rcx + 88]
        vaddsd  xmm12, xmm12, qword ptr [rdi + 8*rcx + 96]
        vaddsd  xmm10, xmm10, qword ptr [rdi + 8*rcx + 104]
        vaddsd  xmm9, xmm9, qword ptr [rdi + 8*rcx + 112]
        vaddsd  xmm8, xmm8, qword ptr [rdi + 8*rcx + 120]
        add     rcx, 16
        cmp     rax, rcx
        jne     .LBB0_2
        vxorpd  xmm1, xmm1, xmm1
        vaddsd  xmm16, xmm0, xmm1

In hopes of the compiler doing the same for a "vector compare" primitive (used for the calculations of ranks, medians etc), I modified the snippet as follows:

const LANES: usize = 16;

pub fn simd_countsmaller(values: &[f64], val: f64) -> usize {
    let chunks = values.chunks_exact(LANES);
    let remainder = chunks.remainder();

    let sum = chunks.fold([0; LANES], |mut acc, chunk| {
        let chunk: [f64; LANES] = chunk.try_into().unwrap();    

        for i in 0..LANES {
            acc[i] += (chunk[i] < val) as usize;      
        }

        acc
    });

    let remainder = remainder.iter().copied().filter(|&n| n < val).count();

    let mut reduced = 0;
    for i in 0..LANES {
        reduced += sum[i];
    }
    reduced + remainder
}

Unfortunately, this seems to be about 4x slower (~ 400ms vs 100ms for 10000 samples) than a simple values.iter().filter(|x| **x < val).count()

I can see why in the compiled Assembly:

        vucomisd        xmm0, qword ptr [rdi + 8*r10]
        seta    al
        add     r12, rax
        xor     eax, eax
        vucomisd        xmm0, qword ptr [rdi + 8*r10 + 8]
        seta    al
        add     rbx, rax
        xor     eax, eax
        vucomisd        xmm0, qword ptr [rdi + 8*r10 + 16]
        seta    al
        add     rdx, rax
        xor     eax, eax
        vucomisd        xmm0, qword ptr [rdi + 8*r10 + 24]
        seta    al
        add     rcx, rax
        xor     eax, eax
        vucomisd        xmm0, qword ptr [rdi + 8*r10 + 32]
        seta    al
        add     rsi, rax
        xor     eax, eax
        vucomisd        xmm0, qword ptr [rdi + 8*r10 + 40]
        seta    al
        add     r9, rax
        xor     eax, eax
        vucomisd        xmm0, qword ptr [rdi + 8*r10 + 48]
        seta    al
        add     r8, rax
        xor     eax, eax
        ................

So my questions are, therefore, as follows:

  1. If the above is broadly the right way to tell the compiler to generate SIMD code, is there something I need to do differently to ensure that it happens?

  2. Otherwise, is there a different way to do what I am trying to do? Specifically, is there a way to do a fast "slice elementwise compare" that would outperform the iterator?

  • Did you enable full optimization? That source should be able to use `vcmpltpd` / `vpsubq` to do 4 elements at once. (Which might be the factor of 4 you're seeing.) Rust isn't auto-vectorizing your add either. but that vectorizing an array sum would require `clang -ffast-math` or the Rust equivalent. With just `-C opt-level=3`, I'd still expect scalar code for summing an array or other reductions. Being way faster than CPython is usually trivial for ahead-of-time compiled languages, and doesn't indicate anything special. – Peter Cordes Jun 23 '23 at 13:57
  • I've very poor knowledge of vector ops, but have you updated the lanes to 8? You're working with f64, so you have half the lanes in 512b. If I keep 16 lanes I get [the same assembly as you](https://godbolt.org/z/cK8rMPTe7) but at 8 I get [a second much denser block](https://godbolt.org/z/hbqGvqYxP). Also maybe configure target-cpu (or target-features), much like C compilers Rust defaults to a pretty conservative configuration with just SSE2, [setting target-cpu=native](https://godbolt.org/z/EasWKP9MT) also leads to large changes, it's unclear whether / how it combines with the lanes. – Masklinn Jun 23 '23 at 14:06
  • @PeterCordes I am curious as to why you conclude that the Rust compiler isn't "auto-vectorizing" the summation? Comparing the assembly generated to what compiler does with iter().sum() seems to suggest there's some vectorization going on. I have not attempted to use any specific flags, other than "cpu-native". As to comparison with Python, most of the packages that are relevant are ahead-of-time compiled themselves, so beating them, IMHO, is not trivial by any means. – Martinghoul Jun 23 '23 at 14:23
  • @Masklinn I have not tried to set LANES to 8 for the compare case, but I do know that it slows down the summation code. This makes sense to me, given that 64-bit SSE has 16 XMM registers. I will try to play with the compiler flags, but I was hoping that there's some low-hanging fruit out there. – Martinghoul Jun 23 '23 at 14:32
  • 1
    @PeterCordes I did take a look at the assembly generated for the simple values.iter().filter(|x| **x < val).count() implementation and it does look like it IS doing what you suggested and uses vcmpltpd/vpsubq to do 4 at a time, so your conjecture regarding the factor of 4 appears to be spot on. – Martinghoul Jun 23 '23 at 14:38
  • Oh, you mean NumPy and similar things? Yeah that's way different from pure Python. Anyway, I conclude it's not auto-vectorizing because it's using `addsd` - Add Scalar Double, on one `f64` at a time. I didn't notice earlier, but your source is using multiple accumulators (`acc[i]`), so instead of 16 separate scalar adds, it could use 4 `vaddpd ymm, ymm, [mem]`. (Or if you did the rustc equivalent of `-mprefer-vector-width=512`, 2 `vaddpd zmm`). – Peter Cordes Jun 23 '23 at 16:31
  • On recent Intel, the latency x bandwidth product of `vaddpd` is 4 x 2 = 8 ops in flight, but 4 vector accumulators (holding 16 scalars) is still pretty good, especially if your data isn't hot in L1d cache. (See also [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) re: hiding latency of FP math operations so out-of-order exec can overlap across loop iterations, keeping the FP execution units fed with new work every cycle.) – Peter Cordes Jun 23 '23 at 16:32
  • With that asm using 16 scalars, you should bottleneck on the throughput of FP add, no latency bottlenecks, but there's a factor of 4 throughput unused since each uop is only doing 1x not 4x f64 adds – Peter Cordes Jun 23 '23 at 16:34
  • Thanks, @PeterCordes, that's super helpful! I am going to explore to see if there's a way to get the Rust compiler to produce this "vectorized" code. – Martinghoul Jun 26 '23 at 12:53
  • And I just can't, no matter how hard I try. As soon as I start manually using x86 intrinsics (e.g. _m256_add_pd), I can't get the Rust compiler to unroll the loop the same way it does when it's all just native floats. Even though the idea is very much the same (except that I am operating with an array of 16 __m256d's that I prepackage), the for loop just uses a single ymm register, rather than vectorizing :(. – Martinghoul Jun 27 '23 at 18:09
  • The plot thickens further: if I simply set LANES to 64 in the snippet referenced above, the Rust compiler replaces vaddsd's with vaddpd's. That's the good news. The bad news is that the vaddpd version of the code is about 4x SLOWER, instead of FASTER! – Martinghoul Jun 28 '23 at 13:14

0 Answers0