Auto-vectorization of a loop shuffling 4 int elements and taking absolute differences vs. the previous arrangement

Question

I'm trying, without succeeding, to make the Rust compiler (rustc 1.66.0) auto-vectorize the following function (without using any unsafe intrinsics):

pub fn get_transforms(s: &mut [i32]) -> u32 {
    assert_eq!(s.len(), 4);

    let mut transforms = 1;

    while s[0] > 0 || s[1] > 0 || s[2] > 0 || s[3] > 0 {
        // Calculate deltas
        let d0 = s[0] - s[1];
        let d1 = s[1] - s[2];
        let d2 = s[2] - s[3];
        let d3 = s[3] - s[0];

        // Assign absolute values
        s[0] = d0.abs();
        s[1] = d1.abs();
        s[2] = d2.abs();
        s[3] = d3.abs();

        transforms += 1;
    }

    transforms
}

My idea is to make it perform the subtractions and the abs() once (using 16 byte registers) instead of four times.

I've read that iterators might help, but I haven't found an easy way to use them here. Compiler flags don't seem to able to help either.

Here's the link to the Compiler Explorer output I'm using as reference: https://godbolt.org/z/7sqq6bszT

It looks to me it optimizes them well in `-Copt-level=3` (`-O` is `-Copt-level=2`, `cargo --release` IIRC is 3). — Chayim Friedman, Jan 11 '23 at 11:39
Neither GCC nor LLVM can auto-vectorize search loops (with a trip count that isn't calculable ahead of the first iteration). [Auto vectorization with Rust](https://stackoverflow.com/a/73119184) — Peter Cordes, Jan 11 '23 at 11:40
@ChayimFriedman: https://godbolt.org/z/8vcj9jfxx doesn't look very good to me; 3x `vpinsrd` from scalar registers every time it wants to set up for `vpsubd` and `vpabsd`, instead of just rotating the vector with `vpshufd`. And no `vmovmaskps` to get the sign bits of those elements. (Or I guess since it wants `>0` not `>=0`, you'd need `vpcmpgt` signed-compare in there.) Anyway, it's maybe vectorizing just the body of one iteration without the loop branch; it's not vectorizing the whole *loop*, and hand-written with intrinsics could go *way* faster. Like maybe 3 to 5x, maybe more. — Peter Cordes, Jan 11 '23 at 11:49
@PeterCordes Isn't that what the OP asked for? "My idea is to make it perform the subtractions and the abs() once (using 32 byte registers) instead of four times." — Chayim Friedman, Jan 11 '23 at 11:51
@ChayimFriedman: The implication is that the whole loop will be efficient. Just satisfying the literal request while storing the result to memory and doing scalar reloads or inserts every iteration doesn't do that. https://godbolt.org/z/qcs6P4Th4 shows that the loop gets more efficient (and easier to follow) if you assert that `s.len() >= 4` ahead of the loop, so it doesn't have to keep checking on every reference to `s[]`. But it's still the same basic strategy of rebuilding two vectors from scratch instead of doing literally one shuffle to generate the subtrahend, big bottleneck on port 5 — Peter Cordes, Jan 11 '23 at 11:58
@ChayimFriedman: On Skylake for example, `vmovd xmm, r32` is one uop for port 5, `vpinsrd xmm, r32` is 2p5, and shuffles like vpunpcklqdq and vpshufd are also p5. (https://uops.info/) So that's 9 uops for port 5 in the inner loop, at best 1 iteration per 9 clocks. vs. maybe one per 3 cycles with a latency bottleneck on `vpshufd` (1c) + `vpsubd` (1c) + `vpabsd` (1c), with the loop exit condition being independent work forked off from that critical path loop-carried dependency chain. — Peter Cordes, Jan 11 '23 at 12:03
@ChayimFriedman: Also no, it's not using 32-byte registers. XMM registers are 16 bytes, which is what you want for 4x i32. The OP's mixed up on that part :P — Peter Cordes, Jan 11 '23 at 12:04
@scasci: are your inputs guaranteed to be non-negative or something? Otherwise your signed subtractions could overflow. e.g. `100 - (-100)` subtraction of two `i8` values can overflow an `i8`; normally with absolute differences you want an unsigned result so you can hold the value no matter how large it is; a signed abs diff can only hold less than half the full range, [0, 2^(n-1) - 1] — Peter Cordes, Jan 11 '23 at 12:08
@PeterCordes you're totally right about the size of the registers XD, and yes the inputs are non-negative (sorry for not pointing it out) — scasci, Jan 11 '23 at 12:13

chtz · Answer 1 · 2023-01-12T22:38:10.337

As @ChayimFriedman noted in the comments, you need to add -Copt-level=3 to the command line. But additionally, you need to make sure that the boolean-expression is not evaluated lazily, i.e., use the | operator instead of || (requires parentheses around the comparisons):

while (s[0] > 0) | (s[1] > 0) | (s[2] > 0) | (s[3] > 0) { 
    // rest of code unmodified

Compiling this produces the following loop:

.LBB0_13:
        vpshufd xmm2, xmm0, 57
        vpsubd  xmm0, xmm0, xmm2
        vpabsd  xmm0, xmm0
        inc     eax
        vpcmpgtd        xmm2, xmm0, xmm1
        vmovmskps       ecx, xmm2
        test    ecx, ecx
        jne     .LBB0_13

The reason why | instead of || helps the compiler is that for expressions like a || b the b expression shall not be evaluated if a is true already -- most of the time this requires a branch depending on the value of a. On the other hand, for a | b both a and b are evaluated every time. Of course, the compiler should be allowed to do any optimization under the "as-if" rule, i.e., if evaluating b does not have any side effects (and the compiler can prove that), a || b and a | b could be compiled to the same code.

N.B.: If you are sure that s[i] are never negative (which is obviously true, if the inputs are non-negative), you can replace s[i] > 0 by s[i] != 0, which will reduce the vpcmpgtd+vmovmskps+test to a single vptest.

Thanks for the explanation! Just wondering: why does the `|` operator make such a big difference? — scasci, Jan 12 '23 at 20:37
Wow, I didn't expect LLVM would be able to vectorize the whole loop, especially not the compare and using `pshufd` to derive one vector from the previous iteration's vector. I guess the usual limit of [not vectorizing when the trip-count isn't calculable ahead of the loop](https://stackoverflow.com/a/73119184) might only apply to loops over arrays. This scalar loop does map naturally to a 4-element vector. — Peter Cordes, Jan 13 '23 at 01:33
@PeterCordes Yes, the inside of OP's loop could itself be written as a loop with length 4 (or probably Rust has some nice map- or fold-expressions for that). The outer loop can't (easily) be vectorized further. — chtz, Jan 13 '23 at 10:06

Auto-vectorization of a loop shuffling 4 int elements and taking absolute differences vs. the previous arrangement

1 Answers1