-4

Because of the length of the array, I can't use i32::from_ne_bytes(), but of course, the following works especially since the code will run only on a cpu architecture supporting unaligned access (or that because of the small length, the whole array might get stored accross several cpu registers).

fn main() {
    let buf: [u8; 10] = [0, 0, 0, 1, 0x12, 14, 50, 120, 250, 6];
    println!("1 == {}", unsafe{std::ptr::read(&buf[1])} as i32);
}

But is there a cleaner way to do it while still not copying the array?

user2284570
  • 2,891
  • 3
  • 26
  • 74

2 Answers2

6

Extract a 4-byte &[u8] slice and use try_into() to convert it into a &[u8; 4] array reference. Then you can call i32::from_ne_bytes().

use std::convert::TryInto;

fn main() {
    let buf: [u8; 10] = [0, 0, 0, 1, 0x12, 14, 50, 120, 250, 6];
    println!("{}", i32::from_ne_bytes((&buf[1..5]).try_into().unwrap()));
}

Output:

302055424

Playground

vallentin
  • 23,478
  • 6
  • 59
  • 81
John Kugelman
  • 349,597
  • 67
  • 533
  • 578
  • 1
    If by *copy* you mean *load from memory into a register*. [Compiler explorer](https://rust.godbolt.org/z/74j7Yv) shows that after checking that the buffer is big enough, its literally just an `lea` instruction. – kmdreko Dec 28 '20 at 18:39
  • @user2284570 Did you include the `use std::convert::TryInto` statement? It works fine on the [playground using nightly](https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=375e84259b055379469fe4342c638405). – John Kugelman Dec 28 '20 at 23:01
2

TL;DR: Realistically, just use John Kugelman's solution, copying 4 bytes is not measurable.

The biggest "measured" difference is 0.09 ps (239.79 - 239.70). That's 90 femtoseconds, or 0.00009 nanoseconds. Running the benchmark again, will yield wildly different results (in the picoseconds range.)

Measuring something as copying 4 bytes is not realistic. We're so far below nanoseconds that this is pure noise.

test #[bench] criterion
try_into 0 ns 239.79 ps
reinterpret 0 ns 239.70 ps
bit unpack 0 ns 239.74 ps
b.iter(|| 1) 240.18 ps
b.iter(|| 1) 239.73 ps
b.iter(|| 1) 239.68 ps

For fun, change all the tests to b.iter(|| 1), and you'll receive similar results fluctuating in picoseconds.

The biggest difference of the b.iter(|| 1) tests, results in 0.5 ps (240.18 - 239.68). That's a "measured" difference of 0.5 ps. That's 500 femtoseconds, or 0.0005 nanoseconds.

That's literally a bigger difference, compared to when we did "actual" "work". This is pure noise.


You're talking about copying 4 bytes. This isn't going to be measurable, even if "every µs matters". This alone isn't going to be measurable in microseconds, and neither in nanoseconds.

(I'll avoid reiterating what's already been said in the comments.)

If you don't want to use TryInto, then you can use some good old bit unpacking and bit shifting. (Out of bounds access will cause a panic.)

let i = (buf[1] as i32) |
        (buf[2] as i32) <<  8 |
        (buf[3] as i32) << 16 |
        (buf[4] as i32) << 24;
println!("{}", i);
// Prints `302055424`

Alternatively, you can also reinterpret buf as a *const i32 pointer and dereference it. However, dereferencing a pointer is unsafe. (Again, out of bounds access can cause a panic.)

// let i = unsafe { &*((buf.as_ptr().add(1)) as *const i32) };
let i = unsafe { &*((buf.as_ptr().offset(1)) as *const i32) };
println!("{:?}", i);
// Prints `302055424`

So you want the best performing solution for copying 4 bytes. Alright, let's take John Kugelman's solution and the previous 2 and benchmark them.

// benches/bench.rs
#![feature(test)]

extern crate test;
use test::Bencher;

use std::convert::TryInto;

#[bench]
fn bench_try_into(b: &mut Bencher) {
    b.iter(|| {
        let buf: [u8; 10] = [0, 0, 0, 1, 0x12, 14, 50, 120, 250, 6];
        i32::from_ne_bytes((&buf[1..5]).try_into().unwrap())
    });
}

#[bench]
fn bench_reinterpret(b: &mut Bencher) {
    b.iter(|| {
        let buf: [u8; 10] = [0, 0, 0, 1, 0x12, 14, 50, 120, 250, 6];
        unsafe { &*((buf.as_ptr().offset(1)) as *const i32) }
    });
}

#[bench]
fn bench_bit_unpack(b: &mut Bencher) {
    b.iter(|| {
        let buf: [u8; 10] = [0, 0, 0, 1, 0x12, 14, 50, 120, 250, 6];
        (buf[1] as i32) | (buf[2] as i32) << 8 | (buf[3] as i32) << 16 | (buf[4] as i32) << 24
    });
}

Now let's benchmark by executing cargo +nightly bench.

running 3 tests
test bench_bit_unpack  ... bench:           0 ns/iter (+/- 0)
test bench_reinterpret ... bench:           0 ns/iter (+/- 0)
test bench_try_into    ... bench:           0 ns/iter (+/- 0)

Like I presumed, copying 4 bytes isn't going to be measurable.


Now, let's try and benchmark with criterion. Maybe the test crate is (being realistic and) limited to nanoseconds, who knows.

// benches/bench.rs
use criterion::{criterion_group, criterion_main, Criterion};
use std::convert::TryInto;

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("try_into", |b| {
        b.iter(|| {
            let buf: [u8; 10] = [0, 0, 0, 1, 0x12, 14, 50, 120, 250, 6];
            i32::from_ne_bytes((&buf[1..5]).try_into().unwrap())
        })
    });

    c.bench_function("reinterpret", |b| {
        b.iter(|| {
            let buf: [u8; 10] = [0, 0, 0, 1, 0x12, 14, 50, 120, 250, 6];
            unsafe { &*((buf.as_ptr().offset(1)) as *const i32) }
        })
    });

    c.bench_function("bit_unpack", |b| {
        b.iter(|| {
            let buf: [u8; 10] = [0, 0, 0, 1, 0x12, 14, 50, 120, 250, 6];
            (buf[1] as i32) | (buf[2] as i32) << 8 | (buf[3] as i32) << 16 | (buf[4] as i32) << 24
        })
    });
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
# Cargo.toml
[dev-dependencies]
criterion = "0.3.3"

[[bench]]
name = "bench"
harness = false

Now, let's benchmark by executing cargo bench.

try_into                time:   [239.69 ps 239.79 ps 239.91 ps]
                        change: [+0.0101% +0.0700% +0.1316%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  7 (7.00%) high severe

reinterpret             time:   [239.63 ps 239.70 ps 239.78 ps]
                        change: [-0.7006% -0.2163% +0.0525%] (p = 0.45 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

bit_unpack              time:   [239.65 ps 239.74 ps 239.84 ps]
                        change: [-0.0768% +0.0775% +0.2867%] (p = 0.45 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  8 (8.00%) high severe

test #[bench] criterion
try_into 0 ns 239.79 ps
reinterpret 0 ns 239.70 ps
bit unpack 0 ns 239.74 ps

So the mean measurements are 239.79 ps, 239.70 ps, and 239.74 ps. So the biggest "measured" difference is 0.09 ps. That's 90 femtoseconds, or 0.00009 nanoseconds. Running the benchmark again, will yield different results. Measuring something standalone as copying 4 bytes is not realistic.

Sure, in that instant "reinterpret" was the "fastest" but we're so far below nanoseconds that this is pure noise.

Use the solution you prefer, there isn't any measurable or significant performance difference between them.


For fun, change all the tests to b.iter(|| 1), and you'll receive similar results fluctuating in picoseconds.

c.bench_function("1", |b| b.iter(|| 1_i32));
c.bench_function("2", |b| b.iter(|| 1_i32));
c.bench_function("3", |b| b.iter(|| 1_i32));

Running the benchmark will result in similar results. I ran it once and got 240.18 ps, 239.73 ps, and 239.68 ps. That's a "measured" difference of 0.5 ps. That's 500 femtoseconds, or 0.0005 nanoseconds.

That's literally a bigger difference, compared to when we did "actual" "work". Again, this is pure noise. This isn't enough "work" to be measurable, in any significant way.

Again, use the solution you prefer, there isn't any measurable or significant performance difference between them.

vallentin
  • 23,478
  • 6
  • 59
  • 81
  • Yes I know, but usually the problem is when new memory is accessed after being allocated (which is how many clone like functions works), then in that case, the delay for assigning physical memory to the virtual memory is in the hundreds ns range, hence the Zero Allocation Programming model found in High Frequency Trading which is the real thing I m interested in. **but the close vote on the question confirms, I should stay with the current simplification and not confuse readers** – user2284570 Dec 28 '20 at 20:59