Why does my code run slower when I remove bounds checks?

Question

I'm writing a linear algebra library in Rust.

I have a function to get a reference to a matrix cell at a given row and column. This function starts with a pair of assertions that the row and column are within bounds:

#[inline(always)]
pub fn get(&self, row: usize, col: usize) -> &T {
    assert!(col < self.num_cols.as_nat());
    assert!(row < self.num_rows.as_nat());
    unsafe {
        self.get_unchecked(row, col)
    }
}

In tight loops, I thought it might be faster to skip the bounds check, so I provide a get_unchecked method:

#[inline(always)]
pub unsafe fn get_unchecked(&self, row: usize, col: usize) -> &T {
    self.data.get_unchecked(self.row_col_index(row, col))
}

The bizarre thing is, when I use these methods to implement matrix multiplication (via row and column iterators), my benchmarks show that it actually goes about 33% faster when I check the bounds. Why is this happening?

I've tried this on two different computers, one running Linux and the other OSX, and both show the effect.

The full code is on github. The relevant file is lib.rs. Functions of interest are:

get at line 68
get_unchecked at line 81
next at line 551
mul at line 796
matrix_mul (benchmark) at line 1038

Note that I'm using type-level numbers to parameterise my matrices (with the option for dynamic sizes too via dummy tagged types), so the benchmark is multiplying two 100x100 matrices.

UPDATE:

I've significantly simplified the code, removing stuff not directly used in the benchmark and removing generic parameters. I also wrote an implementation of multiplication without using iterators, and that version doesn't cause the same effect. See here for this version of the code. Cloning the minimal-performance branch and running cargo bench will benchmark the two different implementations of multiplication (note that the assertions are commented out to start with in that branch).

Also of note is that if I change the get* functions to return copies of the data instead of references (f64 instead of &f64), the effect disappears (but the code is slightly slower all round).

The old question again: have you compiled with compiler optimizations (with the `--release` flag)? ;) — Lukas Kalbertodt, May 29 '16 at 22:46
Without having any idea at all regarding rust: is your benchmarking sane? Cache-effects, variance, test-data synchronization... — sascha, May 29 '16 at 23:19
@LukasKalbertodt Yes, I run my benchmarks with `cargo bench`, which automatically compiles with release. — Bradley Hardy, May 30 '16 at 07:30
@sascha `cargo bench` gives me variance as `+/- X`, and the `X` is nowhere near enough to compensate. My data is first created, then the benchmark starts and runs however many times `cargo bench` chooses over references to the same data. Both benchmarks are run under exactly the same conditions, with the only difference being that I comment out the `assert` lines to remove the bounds checking. — Bradley Hardy, May 30 '16 at 07:40
It would be awesome if you could find a minimal example that still shows this strange performance property. That would increase the probability for someone answering. I'm super interested in the problem, but don't have the time to dig into it now. — Lukas Kalbertodt, May 30 '16 at 08:04
@LukasKalbertodt See my update. It's maybe not as minimal as you'd like, but the code is down to ~200 lines from ~1000, and I've scrapped the generics which make things harder to follow. — Bradley Hardy, May 30 '16 at 09:14
Perhaps the compiler can more aggressively optimize when you check bounds? — HiDefender, Oct 04 '16 at 11:34
For stuff like that it's best to compile both standalone binaries in release mode and use `objdump -D` to identify the machine instructions used in the relevant tight loops. — dpc.pw, Oct 17 '16 at 23:42
You have `unsafe fn get_unchecked()` vs. `fn get()` so my guess is that due to the inlining you get some weird behaviour there. You might want to try and make both functions unsafe as a whole and/or expose a safe interface in both places. — benaryorg, Nov 05 '16 at 02:47
As @dpc.pw mentioned, `objdump` is useful, but you might also want to try using https://play.rust-lang.org and generate the MIR, LLVM IR and Assembly (in that order) and see if you can spot something. Look at both the functions and give the inlining a closer look, because, well, you're inlining `unsafe`. — benaryorg, Nov 05 '16 at 02:50
It could be as stupid as the extra bounds checks make the loop align to the loop-buffer and that way run faster. — Surt, Dec 01 '16 at 21:13
@Surt, How? More exact: bounds checks like this should be eliminated by LLVM. After all, the loop conditions in the matrix multiplications match the bound check conditions perfectly, so the elimination is trivial. The net result is that the code is *exactly* the same - except for the way `unsafe` is used (and compiled...). In other words, I would guess the speedup comes from the plumbing like pinning etc that is eliminated from the inner loop somehow. — atlaste, Jan 23 '17 at 20:08
@Surt Shouldn't LLVM just correct the alignment? If you are right, it's probably a bug. — Shambhav, Jan 31 '22 at 09:52

score 2 · Answer 1 · answered Jan 19 '17 at 09:16

It's not a complete answer because I haven't tested my claims, but this might explain it. Either ways, the only way to know for sure is to generate the LLVM IR and the assembler output. If you need a manual for LLVM IR, you can find it here: http://llvm.org/docs/LangRef.html .

Anyways, enough about that. Let's say you have this code:

#[inline(always)]
pub unsafe fn get_unchecked(&self, row: usize, col: usize) -> &T {
    self.data.get_unchecked(self.row_col_index(row, col))
}

The compiler here changes this into an indirect load, which will probably be optimized in a tight loop. It's interesting to note that each load has a possibility to go wrong: if your data isn't available, it'll trigger an out-of-bounds.

In the case with the bounds check combined with the tight loop, LLVM does a little trick. Because the load is in a tight loop (a matrix multiplication) and because the result of the bounds check depends on the bounds of the loop, it will remove the bounds check from the loop and put it around the loop. In other words, the loop itself will remain exactly the same, but with an extra bounds check.

In other words, the code is exactly the same, with some minor differences.

So what changed? Two things:

If we have the additional bounds check, there's no possibility anymore for an out-of-bounds load. This might trigger an optimization that wasn't possible before. Still, considering how these checks are usually implemented, this wouldn't be my guess.
Another thing to consider is that the word 'unsafe' might trigger some behavior, like an additional condition, pin data or disable the GC, etc. I'm not sure about this exact behavior in Rust; the only way to find out these details is to look at the LLVM IR.

Why does my code run slower when I remove bounds checks?

1 Answers1

Linked