5

I want to compare two vectors of 16 bytes and get every matching index. A small example to illustrate what I want:

fn get_matching_idx(arr1: &[u8], arr2: &[u8]) {
    let vec1 = u8x16::load_aligned(arr1);    
    let vec2 = u8x16::load_aligned(arr2);
    let matches = vec1.eq(vec2);
    for i in 0..16 {
        if matches.extract_unchecked(i) {
            // Do something with the index
        }
    }
}

Ideally, I'd just want to "Do something" for the set indices, rather than checking every single one (there will be a low number of matches).

Is there a way to get the matching indices using intrinsics, rather than iterating through the whole vector? With gcc for example, I could use _mm_movemask_epi8 to bit pack the vector and then repeated applications of __builtin_clz to get the index of the first set bit (which is more performant for sparse numbers which I would have). Alternatively, I could have a lookup table which did the right thing for each nibble in my bit-packed integer (e.g. the first answer here).

Is there an equivalent of these instructions in rust?

I'm compiling for an Intel x86-64 processor and cross platform support is not a requirement.

NOTE: I'd prefer a solution in native (safe) rust, but this is not a hard requirement. I am fine writing unsafe rust, or even using some sort of FFI to link to the aforementioned methods.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user1413793
  • 9,057
  • 7
  • 30
  • 42
  • 5
    You can use the same intrinsic in Rust via `std::arch`: https://doc.rust-lang.org/nightly/core/arch/x86_64/fn._mm_movemask_epi8.html --- Note that this is a nightly only API, but is planned to be stabilized soon. If you need to do this on stable Rust, then the easiest path is to probably right your SIMD routine in C. – BurntSushi5 Apr 10 '18 at 18:02

1 Answers1

1

std::arch contains an exhaustive set of intrinsic operations. This can be done using core::arch and std::simd as follows:

use std::arch::x86_64::{self, __m128i};
use std::simd::{u8x16, FromBits};

unsafe fn get_matching_idx(arr1: &[u8], arr2: &[u8]) -> u32 {
    let vec1 = __m128i::from_bits(u8x16::load_aligned_unchecked(arr1));
    let vec2 = __m128i::from_bits(u8x16::load_aligned_unchecked(arr2));
    return x86_64::_mm_movemask_epi8(x86_64::_mm_cmpeq_epi8(vec1, vec2)) as u32;
}

fn main() {
    // let arr1 = ...
    // let arr2 = ...

    unsafe {
        let mut mask = get_matching_idx(arr1, arr2);
    }
    let mut delta_i = 0;
    // This assumes a little endian machine (note it counts trailing 0s)
    while group_mask > 0 {
        let tz = x86_64::_mm_tzcnt_32(mask);
        let i = tz + delta_i;
        // Do something...
        group_mask >>= tz + 1;
        delta_i += tz + 1;
    }
}
user1413793
  • 9,057
  • 7
  • 30
  • 42
  • Why `core::arch` and not `std::arch`? Also, why `_mm_tzcount_32` and not [`u32::trailing_zeroes`](https://doc.rust-lang.org/std/primitive.u32.html#method.trailing_zeros)? – Matthieu M. Apr 11 '18 at 08:12
  • It looks like _mmtzcount_32 just calls the assembly instruction tzcntl while u32::trailing_zeros does more (but feel free to correct me if I'm wrong): https://play.rust-lang.org/?gist=8008a2a62494c2b8d55fc1c89d3ca401&version=nightly – user1413793 Apr 11 '18 at 20:54
  • Updated to use std::arch – user1413793 Apr 11 '18 at 20:55
  • `trailing_zeros` has a check for a 0 argument; I am not sure about the intrinsic behavior with a 0 argument, my experience is with gcc intrinsics in this regard which specify that a 0 argument leads to undefined behavior. – Matthieu M. Apr 12 '18 at 06:40
  • 1
    @MatthieuM. `_mm_tzcnt_32` is (by equivalence with the `tzcnt` instruction) defined for a zero input. The GCC built-in is probably unspecified for 0 so it can use an instruction(sequence) that breaks for zero, such as the simplest use of `bsf` (which does not write to its destination when the input is zero) – harold Apr 14 '18 at 22:08