How to make sure Box::new() really does heap allocation?

Question

I'm trying to measure the performance of Box::new():

fn main() {
    let start = Instant::now();
    let mut sum = 0;
    for _ in 0..100000 {
        sum += 42;
    }
    println!("Simple sum: {:?}", start.elapsed());
    let start2 = Instant::now();
    for _ in 0..100000 {
        let b = Box::new(42);
        Box::leak(b);
    }
    println!("Many heap calls: {:?}", start2.elapsed());
}

I'm getting:

Simple sum: 1.413291ms
Many heap calls: 6.9935ms

Obviously, the data doesn't look right. Box::new() must be a much heavier operation than just 5x of +=. Where does the optimization kick in? How to disable it?

Does this answer your question? [How can I prevent the Rust benchmark library from optimizing away my code?](https://stackoverflow.com/questions/32385805/how-can-i-prevent-the-rust-benchmark-library-from-optimizing-away-my-code) — cafce25, Mar 31 '23 at 11:08
Most allocators have very complex behaviour, and you can only benchmark them for a specific use case. In this particular use case, the allocator is very light-weight. You can hardly ever make conclusions for other use cases of the same allocator, unless you understand the characteristics of the allocator deeply. — Sven Marnach, Mar 31 '23 at 11:10
Also note that you should usually perform benchmarks with optimizations enabled (your numbers look like unoptimized performance). You should also run the code multiple times and average over the runs with outliers excluded. — Sven Marnach, Mar 31 '23 at 11:12
I've recently done something similar, `black_box(Box::leak(b) as *mut _)` helped for me, but I did end up just calling `alloc` directly later. If in doubt, also check whether the program's memory usage isn't too low, e.g. on linux via `procfs::process::Process::myself().unwrap().status().unwrap().vmrss.unwrap() * 1024` (But beware that a 4 byte box might actually take 32 bytes because the allocator has a minimum size). — Caesar, Mar 31 '23 at 13:18

Finomnis · Accepted Answer · 2023-04-01T10:55:07.617

Make sure to run benchmark code with --release, otherwise the results will be pretty much meaningless.

In your case, if I run it with --release, I get:

Simple sum: 100ns
Many heap calls: 100ns

This means that the compiler completely optimized away everything, because your loops had zero side effects. If (apart from the time it would take) an operation has no effect, the compiler is allowed to simply remove it.

Note that the compiler even warns:

warning: variable `sum` is assigned to, but never used
 --> src\main.rs:5:13
  |
5 |     let mut sum = 0;
  |             ^^^
  |
  = note: consider using `_sum` instead
  = note: `#[warn(unused_variables)]` on by default

That said, there are situations where you want to keep the operation even though they have no side effect, like for benchmarking. For that, Rust provides std::hint::black_box, which is a function that returns exactly what you give to it, but looks to the compiler as if some fancy calculation would take place so that the compiler can no longer prove that the input is equal to the output. That prevents the compiler from optimizing this function away, and with that everything that feeds into it.

In your case, this is one example of how you could prevent Rust from optimizing away your loop:

use std::time::Instant;

fn main() {
    let start = Instant::now();
    let mut sum = 0;
    for _ in 0..100000 {
        sum += 42;
        std::hint::black_box(sum);
    }
    println!("Simple sum: {:?}", start.elapsed());
    let start2 = Instant::now();
    for _ in 0..100000 {
        let b = std::hint::black_box(Box::new(42));
        std::hint::black_box(Box::leak(b));
    }
    println!("Many heap calls: {:?}", start2.elapsed());
}

Simple sum: 27.2µs
Many heap calls: 2.8956ms

Now those numbers make more sense.

To be 100% sure that it didn't optimize away anything important, you could always check with the disassembly. Because asm surrounding println!()s is hard to read, it makes sense to extract them into their own functions.

Be sure to make those functions pub to make them show up in the final assembly, otherwis they might disappear due to inlining.

Here is how this would look:

use std::time::Instant;

pub fn simple_sum() {
    let mut sum = 0;
    for _ in 0..100000 {
        sum += 42;
        std::hint::black_box(sum);
    }
}

pub fn many_heap_calls() {
    for _ in 0..100000 {
        let b = std::hint::black_box(Box::new(42));
        std::hint::black_box(Box::leak(b));
    }
}

fn main() {
    let start = Instant::now();
    simple_sum();
    println!("Simple sum: {:?}", start.elapsed());
    let start2 = Instant::now();
    many_heap_calls();
    println!("Many heap calls: {:?}", start2.elapsed());
}

example::simple_sum:
        sub     rsp, 4
        mov     eax, 42
        mov     rcx, rsp
.LBB0_1:
        mov     dword ptr [rsp], eax
        add     eax, 42
        cmp     eax, 4200042
        jne     .LBB0_1
        add     rsp, 4
        ret

example::many_heap_calls:
        push    r15
        push    r14
        push    rbx
        sub     rsp, 16
        mov     ebx, 100000
        mov     r14, qword ptr [rip + __rust_alloc@GOTPCREL]
        lea     r15, [rsp + 8]
.LBB1_1:
        mov     edi, 4
        mov     esi, 4
        call    r14
        test    rax, rax
        je      .LBB1_4
        mov     dword ptr [rax], 42
        mov     qword ptr [rsp + 8], rax
        mov     rax, qword ptr [rsp + 8]
        mov     qword ptr [rsp + 8], rax
        dec     ebx
        jne     .LBB1_1
        add     rsp, 16
        pop     rbx
        pop     r14
        pop     r15
        ret
.LBB1_4:
        mov     edi, 4
        mov     esi, 4
        call    qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
        ud2

The important part to notice here is the .LBB0_1:, .LBB1_1: and the jne .LBB0_1 and jne .LBB1_1, which are the two for loops. This shows that the loops did not get optimized away.

Also note the mov r14, qword ptr [rip + __rust_alloc@GOTPCREL] and call r14, which is the actual call that does the heap allocation. So this one also didn't get optimized away.

Also, notice the interesting looking cmp eax, 4200042. This one shows that it reworked the first loop; instead of doing:

    let mut sum = 0;
    for _ in 0..100000 {
        sum += 42;
    }

it optimized it to

    let mut sum = 0;
    while sum != 4200042 {
        sum += 42;
    }

which does in fact give the same result and reuses the sum variable as the loop counter :)

Now compared to how it was before:

use std::time::Instant;

pub fn simple_sum() {
    let mut sum = 0;
    for _ in 0..100000 {
        sum += 42;
    }
}

pub fn many_heap_calls() {
    for _ in 0..100000 {
        let b = Box::new(42);
        Box::leak(b);
    }
}

fn main() {
    let start = Instant::now();
    simple_sum();
    println!("Simple sum: {:?}", start.elapsed());
    let start2 = Instant::now();
    many_heap_calls();
    println!("Many heap calls: {:?}", start2.elapsed());
}

example::simple_sum:
        ret

example::many_heap_calls:
        ret

I don't think this one requires further explanation ;)

How to make sure Box::new() really does heap allocation?

1 Answers1