4

I am new with rust/SIMD and I have the following code snippet as the bottleneck of my program, I am wondering if I can leverage on the autovectorization feature of it

fn is_subset(a: Vec<i64>, b: Vec<i64>) -> bool {
    for i in 0..a.len() {
        if (a[i] & !b[i]) != 0 {
            return false;
        }
    }
    true
}

I also have an alternative way for writing this (with iterator so trip count will be known up front), would this create autovectorization instead?

fn is_subset(a: Vec<i64>, b: Vec<i64>) -> bool {
    return a.iter().zip(b.iter()).all(|(x, y)| x & y == *x)
}

xxx222
  • 2,980
  • 5
  • 34
  • 53
  • 1
    Is it possible the bottleneck is caused by how you call this function? It accepts an owned `Vec` rather than a slice type like `&[i32]`. It can be quite expensive to clone or rebuild `Vec`s due to the cost of allocation, copying all the elements, then dropping the `Vec` after the function ends. I have very little experience with simd, but I assume it should be attempting auto vectorization for you when you build in release mode. – Locke Jul 26 '22 at 06:36
  • 1
    @Locke shouldn't be relevant, there is no "clone or rebuild" inside the function, it just drops the vecs at the of the scope. Converting the function to using iterators doesn't vectorise either, even using `-C opt-level=3 -C target-cpu=native`. – Masklinn Jul 26 '22 at 06:39
  • In fact even trying to `chunks` the vectors, converting the chunks to fixed size arrays, and working with that, doesn't seem to trigger autovec. So seems non trivial for this case (though my knowledge of autovec is relatively limited so I might be missing something obvious). – Masklinn Jul 26 '22 at 06:47
  • 1
    @Masklinn my point is more that the `Vec` has to come some somewhere. Are we absolutely sure that the problem lies inside this function? As for auto vectorization, I don't know enough to suggest a solution. – Locke Jul 26 '22 at 06:55
  • I don't know where else it could lie, since you can compile the function without any caller as a `pub fn` through something like godbolt. The optimisation could also be made to trigger through the function being inlined and the compiler having more insight, but I'd usually assume it'd be optimised on its own. – Masklinn Jul 26 '22 at 07:09
  • 1
    If the compiler wanted to SIMD optimize this, it would need to access `a[0..3]` and `b[0..3]` at once, which would not happen if the function ended before the 3rd iteration (What if `a.len()>4 & b.len() < 4`, but `(a[0] & !b[0]) != 0`?). – chtz Jul 26 '22 at 07:24

1 Answers1

7

LLVM (and GCC) don't know how to auto-vectorize loops whose trip-count can't be calculated up front. This rules out search loops like this.

ICC classic can auto-vectorize such loops, but it's a C/C++ compiler, without a front-end for Rust.

Probably your only hope would be to manually loop over 2, 4, or 8-element chunks of the arrays, branchlessly calculating your condition based on all those elements. If you're lucky, LLVM might turn that into operations on one SIMD vector. So using that inner loop inside a larger loop could result in getting the compiler to make vectorized asm, for example using AVX vptest (which sets CF according to bitwise a AND (not b) having any non-zero bits).

i.e. manually express the "unrolling" of SIMD elements in your source, for a specific vector width.

Related re: getting compilers to auto-vectorize with x86 ptest:

If the whole array is small enough, a branchless reduction over the whole array (ORing boolean results together) could be something a compiler is willing to do something with. If compiling for x86, you'd be hoping for asm like pandn / por in the loop, and a horizontal reduction at the end, so see if any bits were set in the intersection of a and not b.

i64 is only 2 elements per 16-byte vector, so the compiler would have to see a good strategy for it to be profitable to auto-vectorize, especially on a 64-bit machine. Having 32-byte vectors available makes it more attractive.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • for your answer, ``` LLVM (and GCC) don't know how to auto-vectorize loops whose trip-count can't be calculated up front. ``` is there anything I can do to let them know the trip-count before hand?Does it help if I precalculate the length with `min(a.len(), b.len())`? – xxx222 Jul 26 '22 at 07:56
  • @xxx222: Yeah, you can remove the `return false;` from inside the loop, so the loop will always run `a.len()` iterations. But then it wouldn't be the search function you wanted it to be. If removing data-dependent branching would require min(lengths), then yeah do that if you want to attempt it. I had assumed your inputs would be the same length, but I guess `b` can be shorter if there's a match anywhere before its end with the early-out version. – Peter Cordes Jul 26 '22 at 08:18
  • I see what u meant, if I wrote it this way (using iterator without early stopping) `a.iter().zip(b.iter()).all(|(x, y)| x & y == *x)` would this trigger autovectorization instead? – xxx222 Jul 26 '22 at 17:05