I'm trying to create efficient SIMD version of dot product to implement 2D convolution for i16 type for FIR filter.
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;
#[target_feature(enable = "avx2")]
unsafe fn dot_product(a: &[i16], b: &[i16]) {
let a = a.as_ptr() as *const [i16; 16];
let b = b.as_ptr() as *const [i16; 16];
let a = std::mem::transmute(*a);
let b = std::mem::transmute(*b);
let ms_256 = _mm256_mullo_epi16(a, b);
dbg!(std::mem::transmute::<_, [i16; 16]>(ms_256));
let hi_128 = _mm256_castsi256_si128(ms_256);
let lo_128 = _mm256_extracti128_si256(ms_256, 1);
dbg!(std::mem::transmute::<_, [i16; 8]>(hi_128));
dbg!(std::mem::transmute::<_, [i16; 8]>(lo_128));
let temp = _mm_add_epi16(hi_128, lo_128);
}
fn main() {
let a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
let b = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
unsafe {
dot_product(&a, &b);
}
}
I] ~/c/simd (master|…) $ env RUSTFLAGS="-C target-cpu=native" cargo run --release | wl-copy
warning: unused variable: `temp`
--> src/main.rs:16:9
|
16 | let temp = _mm_add_epi16(hi_128, lo_128);
| ^^^^ help: if this is intentional, prefix it with an underscore: `_temp`
|
= note: `#[warn(unused_variables)]` on by default
warning: 1 warning emitted
Finished release [optimized] target(s) in 0.00s
Running `target/release/simd`
[src/main.rs:11] std::mem::transmute::<_, [i16; 16]>(ms_256) = [
0,
1,
4,
9,
16,
25,
36,
49,
64,
81,
100,
121,
144,
169,
196,
225,
]
[src/main.rs:14] std::mem::transmute::<_, [i16; 8]>(hi_128) = [
0,
1,
4,
9,
16,
25,
36,
49,
]
[src/main.rs:15] std::mem::transmute::<_, [i16; 8]>(lo_128) = [
64,
81,
100,
121,
144,
169,
196,
225,
]
While I understand SIMD conceptually I'm not familiar with exact instructions and intrinsics.
I know what I need to multiply two vectors and then horizontally sum then by halving them and using instructions to vertically add two halved of lower size.
I've found madd instruction which supposedly should do one such summation after multiplications right away, but not sure what to do with the result.
If using mul instead of madd I'm not sure which instructions to use to reduce the result further.
Any help is welcome!
PS I've tried packed_simd but it seems like it doesn't work on stable rust.