9

I have many 4KiB buffers, which have a 50% chance of containing only zero values. The non-zero buffers will typically have a non-zero byte early in the buffer.

fn is_zero(buf: &Vec<u8>) -> bool {
    for byte in buf.into_iter() {
        if *byte != 0 {
            return false;
        }
    }
    return true;
}

Is this a performant way of checking in Rust, with --release? (I am processing many GBs of data.)

(In the C version, I cast the buffer to unsigned long long before checking. That probably wasn't the best I could do, given SSE etc.)

fadedbee
  • 42,671
  • 44
  • 178
  • 308
  • 2
    Not directly related to performance, but one thing you should do is change the argument to `&[u8]` to avoid double indirection (if not optimized out in the first place). See [Why is it discouraged to accept a reference to a String (&String), Vec (&Vec), or Box (&Box) as a function argument?](https://stackoverflow.com/q/40006219/9716597). And remove the redundant `.into_iter` as well. – L. F. Dec 19 '20 at 08:35
  • 1
    You can write it as `buf.into_iter().all(|&b| b == 0)` which is (IMHO) clear enough that it doesn't require a separate function in the first place. In release mode performance should be equivalent to that of a manual loop. – user4815162342 Dec 19 '20 at 11:14
  • It's a shame that there is no negative-match variant of `memchr`. – Peter Hall Dec 19 '20 at 14:31

4 Answers4

9

You an use align_to to convert the slice of u8 into a slice of u128, making the comparison more efficient:

fn is_zero(buf: &[u8]) -> bool {
    let (prefix, aligned, suffix) = unsafe { buf.align_to::<u128>() };

    prefix.iter().all(|&x| x == 0)
        && suffix.iter().all(|&x| x == 0)
        && aligned.iter().all(|&x| x == 0)
}

Running a simple benchmark on my machine shows 16x performance gains!

#![feature(test)]
extern crate test;

fn v() -> Vec<u8> {
    std::iter::repeat(0).take(1000000).collect()
}

fn is_zero(buf: &[u8]) -> bool {
    buf.into_iter().all(|&b| b == 0)
}

fn is_zero_aligned(buf: &[u8]) -> bool {
    let (prefix, aligned, suffix) = unsafe { buf.align_to::<u128>() };

    prefix.iter().all(|&x| x == 0)
        && suffix.iter().all(|&x| x == 0)
        && aligned.iter().all(|&x| x == 0)
}

#[bench]
fn bench_is_zero(b: &mut test::Bencher) {
    let v = test::black_box(v());
    b.iter(|| is_zero(&v[..]))
}

#[bench]
fn bench_is_zero_aligned(b: &mut test::Bencher) {
    let v = test::black_box(v());
    b.iter(|| is_zero_aligned(&v[..]))
}
running 2 tests
test tests::bench_is_zero         ... bench:     455,975 ns/iter (+/- 414)
test tests::bench_is_zero_aligned ... bench:      28,615 ns/iter (+/- 116)

Depending on your machine, different integer types (u64) may yield better performance.

Thanks to @Globi on the Rust discord server for the idea

Ibraheem Ahmed
  • 11,652
  • 2
  • 48
  • 54
4

Found 4x speedup on my laptop with byteorder, by reading u64 at a time, in native endian.

lib.rs

extern crate byteorder;

use byteorder::{NativeEndian, ReadBytesExt};
use std::io::Cursor;

pub fn one(buf: &[u8]) -> bool {
    buf.into_iter().all(|&byte| byte == 0)
}

pub fn two(buf: &[u8]) -> bool {
    let mut cur = Cursor::new(buf);
    while let Ok(val) = cur.read_u64::<NativeEndian>() {
        if val != 0 {
            return false;
        }
    }
    while let Ok(val) = cur.read_u8() {
        if val != 0 {
            return false;
        }
    }
    true
}

benches/benches.rs

#![feature(test)]

extern crate test;
extern crate zero_slice_8;

use zero_slice_8::{one, two};

fn v() -> Vec<u8> {
    let mut result = vec![];
    for _ in 0..100000 {
        result.push(0);
    }
    result
}

#[bench]
fn bench_one(b: &mut test::Bencher) {
    let v = v();
    b.iter(|| one(&v[..]))
}

#[bench]
fn bench_two(b: &mut test::Bencher) {
    let v = v();
    b.iter(|| two(&v[..]))
}
Peter Blackson
  • 254
  • 1
  • 5
1

The following function is pure save Rust:

fn is_zero ( slice : &[u8] ) -> bool {
    for i in (0..slice.len()).step_by(16) {
        if slice.len() - i >= 16 {
            let arr : [u8; 16] = slice[i..i+16].try_into().expect("this should always succeed");
            if u128::from_be_bytes(arr) != 0 {
                return false;
            }
        } else {
            for i in i..slice.len() {
                if slice[i] != 0 {
                    return false;
                }
            }
        }
    }
    return true;
}

Specifically, it uses the u128::from_be_bytes function to convert a [u8; 16] array to a u128 as a non-op, and uses the TryInto trait to turn a [u8] of appropriate length into a [u8; 16] — the rest is fairly trivial. It is possible to manually unroll the inner loop to convert it, but I doubt this will be a significant performance bottleneck that the u8s that form the tail of the list that aren't cleanly 16 bytes would work.

Depending on the processor, using u64 or even u32 might be faster, one would have to profile that for oneself.

Zorf
  • 6,334
  • 2
  • 29
  • 24
  • Can the `try_into` fail if the slice is not 16-byte aligned? – Peter Hall Dec 19 '20 at 14:40
  • @PeterHall no it won't fail because of alignment; it can only fail because of length. I checked the implementation. You may be thinking of SSE, where some instructions may fail when given an unaligned address. – Peter Blackson Dec 19 '20 at 15:14
  • Hm so what happens on architectures that can't handle misaligned integers? – Peter Hall Dec 19 '20 at 15:59
  • @PeterHall The small array is stored on the stack before it becomes an integer. So on architectures that forbid misaligned integer loads, the array load may become a series of byte loads. However, popular architectures allow misaligned word loads, even if this lack of alignment comes with an additional performance penalty, typically only when loads cross the cache-line or even page boundary. So, good, you found a (small) flaw in the response. – Peter Blackson Dec 19 '20 at 16:42
  • @Peter_Hall, it wouldn't fail, but that's a very good point for optimization as a misaligned one would obviously be slower to convert. A proper implementation would also do the head of the list seperately until reaching an aligned pointer and go from there. – Zorf Dec 19 '20 at 19:57
1

You could use rayon, a data parallelism library that seems like a good fit for your use case. It is very simple to use: Just change buf.iter() into buf.par_iter(), and Rayon does the rest:

use rayon::prelude::*;

fn is_zero_par(buf: &[u8]) -> bool {
    buf.par_iter().all(|&b| b == 0)
}

For a vector of 20 million elements, rayon shows a 7x increase in performance:

#![feature(test)]
use rayon::prelude::*;
extern crate test;

fn v() -> Vec<u8> {
    std::iter::repeat(0).take(20000000).collect()
}

fn is_zero(buf: &[u8]) -> bool {
    buf.into_iter().all(|&b| b == 0)
}

fn is_zero_par(buf: &[u8]) -> bool {
    buf.par_iter().all(|&b| b == 0)
}

#[bench]
fn bench_is_zero(b: &mut test::Bencher) {
    let v = test::black_box(v());
    b.iter(|| is_zero(&v[..]))
}

#[bench]
fn bench_is_zero_par(b: &mut test::Bencher) {
    let v = test::black_box(v());
    b.iter(|| is_zero_par(&v[..]))
}
running 2 tests
test tests::bench_is_zero     ... bench:   7,217,686 ns/iter (+/- 478,845)
test tests::bench_is_zero_par ... bench:   1,080,959 ns/iter (+/- 111,692)

Note that the performance impact of multi-threading depends on the workload (number of elements), and smaller workloads may get impacted negatively.

Ibraheem Ahmed
  • 11,652
  • 2
  • 48
  • 54
  • 1
    May not get as much benefit for OP's much smaller buffers though. – Peter Hall Dec 20 '20 at 02:47
  • @PeterHall It would be just as easy to iterate over buffers in parallel, and for each buffer simply iterate over bytes sequentially. However, OP did not provide any code to shed some light on his collection of buffers. – Peter Blackson Dec 20 '20 at 12:40