efficient SIMD dot product in rust

Question

I'm trying to create efficient SIMD version of dot product to implement 2D convolution for i16 type for FIR filter.

#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

#[target_feature(enable  = "avx2")]
unsafe fn dot_product(a: &[i16], b: &[i16]) {
    let a = a.as_ptr() as *const [i16; 16];
    let b = b.as_ptr() as *const [i16; 16];
    let a = std::mem::transmute(*a);
    let b = std::mem::transmute(*b);
    let ms_256 = _mm256_mullo_epi16(a, b);
    dbg!(std::mem::transmute::<_, [i16; 16]>(ms_256));
    let hi_128 = _mm256_castsi256_si128(ms_256);
    let lo_128 = _mm256_extracti128_si256(ms_256, 1);
    dbg!(std::mem::transmute::<_, [i16; 8]>(hi_128));
    dbg!(std::mem::transmute::<_, [i16; 8]>(lo_128));
    let temp = _mm_add_epi16(hi_128, lo_128);
}

fn main() {
    let a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
    let b = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
    unsafe {
        dot_product(&a, &b);
    }
}

I] ~/c/simd (master|…) $ env RUSTFLAGS="-C target-cpu=native" cargo run --release | wl-copy
warning: unused variable: `temp`
  --> src/main.rs:16:9
   |
16 |     let temp = _mm_add_epi16(hi_128, lo_128);
   |         ^^^^ help: if this is intentional, prefix it with an underscore: `_temp`
   |
   = note: `#[warn(unused_variables)]` on by default

warning: 1 warning emitted

    Finished release [optimized] target(s) in 0.00s
     Running `target/release/simd`
[src/main.rs:11] std::mem::transmute::<_, [i16; 16]>(ms_256) = [
    0,
    1,
    4,
    9,
    16,
    25,
    36,
    49,
    64,
    81,
    100,
    121,
    144,
    169,
    196,
    225,
]
[src/main.rs:14] std::mem::transmute::<_, [i16; 8]>(hi_128) = [
    0,
    1,
    4,
    9,
    16,
    25,
    36,
    49,
]
[src/main.rs:15] std::mem::transmute::<_, [i16; 8]>(lo_128) = [
    64,
    81,
    100,
    121,
    144,
    169,
    196,
    225,
]

While I understand SIMD conceptually I'm not familiar with exact instructions and intrinsics.

I know what I need to multiply two vectors and then horizontally sum then by halving them and using instructions to vertically add two halved of lower size.

I've found madd instruction which supposedly should do one such summation after multiplications right away, but not sure what to do with the result.

If using mul instead of madd I'm not sure which instructions to use to reduce the result further.

Any help is welcome!

PS I've tried packed_simd but it seems like it doesn't work on stable rust.

For large arrays, you "vertically" sum the [`vpmaddwd`](https://www.felixcloutier.com/x86/pmaddwd) results into a vector accumulator with `_mm256_add_epi32`, and only do the final horizontal-add (of one or a couple vectors) at the end. (See [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/a/35270026), assuming you can adapt C/C++ intrinsics to Rust). If you need to avoid overflowing 32-bit sums, you need to widen again after `pmaddwd` inside the loop. — Peter Cordes, Jan 31 '21 at 23:57
@PeterCordes As I've said I'm doing dot product for 2D convolution in FIR filter. So one of the sequences is filter kernel which is going to be something like 16 or 32 taps. — user1685095, Feb 01 '21 at 07:07
My experience with SIMD and AVX hasn't included FIR filters or that kind of signal processing, so that didn't tell me what length of dot-product we were dealing with. I think you're implying that there are 16 or 32 elements to sum, so that's more than 1 SIMD vector, so what I said still applies (except you don't need a loop, just fully unroll): vertical add `vpmaddwd` results until you have one vector, then horizontally reduce. Or if you need to avoid overflow in the 32-bit sums of 16-bit products, widen to 64-bit at some point. — Peter Cordes, Feb 01 '21 at 07:11
@PeterCordes `as *const [i16; 16];` this says that it's vector or 16 16bit integers. I'm not sure why to horizontally sum 16 bit vector I would need more than 1 SIMD vector. Could you elaborate? Maybe write some code in rust playground? — user1685095, Feb 01 '21 at 07:33
I don't know Rust very well, I'm here for the [simd] tag. Also, I thought this might be a one-vector function you were going to call repeatedly in a loop. Anyway, yes, with only 16x 2 = 32 bytes of input data, you just have one vpmaddwd result if you can use AVX2. hsum it by repeatedly extracting the high half and adding, as in the linked Q&A in my first comment. — Peter Cordes, Feb 01 '21 at 07:43
@PeterCordes Can you show a code in C with intrinsics? I'm not sure how to hsum in repeatedly and extracting high half. Thanks for your help! — user1685095, Feb 01 '21 at 08:39
I don't know how good the rust auto-vectorizer works, but in C/C++ you can very often compile (on a modern compiler) a simple implementation with `-O3` and some target specification (`-march=native` when building for your local machine), and you get pretty decent assembly: https://godbolt.org/z/zdEoz7. You can look up the corresponding intrinsic for each generated instruction, if you want to translate it to intrinsics. — chtz, Feb 01 '21 at 09:53
@AngelicosPhosphoros Nice, I guess if you cast `a` and `b` to `i32` before doing the multiplication, you might even get a `vpmaddwd` (I don't know enough rust to try that out ...) — chtz, Feb 01 '21 at 12:10
[Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/a/35270026) has C with intrinsics and links to other sizes, such as [Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2](https://stackoverflow.com/q/60108658). That's why I linked it for you in the first place. — Peter Cordes, Feb 01 '21 at 18:32

efficient SIMD dot product in rust

0 Answers0