Does Rust's array bounds checking affect performance?

Question

I'm coming from C and I wonder whether Rust's bounds checking affects performance. It probably needs some additional assembly instructions for every access, which could hurt when processing lots of data.

On the other hand, the costly thing in processor performance is memory, so more arithmetic assembler instructions might not hurt, but then it might matter that after a cache line is loaded, sequential access should be very fast.

Has somebody benchmarked this?

Why not benchmark it yourself on your machine, where the results will be most accurate for you? In general, it's pointless worrying about really low-level performance impacts until you absolutely have to. Trying to pre-optimise usually leads to complicated code that has no real performance benefit. *Benchmarks can't simulate and don't reflect real usage*. — hnefatl, Nov 28 '17 at 23:08
I'll maybe benchmark it later and answer this question myself, but first I have to learn the basics (how does time measurement in rust work? :-D ). I just thought this might be interesting for theory, as I came across this topic in my studies with other languages like C or C++ in HPC. — Jounathaen, Nov 28 '17 at 23:13
Rust doesn't always perform a bounds check. I think iterating over an array doesn't result in bounds check for every item. And if you need to access individual items by index, there are unsafe functions for unchecked access. — Pavel Strakhov, Nov 28 '17 at 23:22
@hnefatl: if you don't know a lot of details about how something might work, you can't design a benchmark that actually measures what you want to measure. It's *very* easy for microbenchmarks to end up measuring something totally different from what the author intended. "Just benchmark" replies aren't as helpful as most people seem to think, especially when we're talking about a microbenchmark. e.g. you might pick a case where there happened to be no overhead for bounds checking, and miss the fact that there is overhead in other cases that look similar, or that are very different. — Peter Cordes, Nov 29 '17 at 03:41
@PeterCordes Yeah, the last part of my comment in italics was meant to be a bit of a "disclaimer" about that, implying that "invented" benchmarks in general don't catch the cases that occur production code. — hnefatl, Nov 29 '17 at 08:48

Veedrac · Accepted Answer · 2017-11-29T10:15:05.750

Unfortunately, the cost of a bounds check is not a straightforward thing to estimate. It's certainly not "one cycle per check", or any such easy to guess cost. It will have nonzero impact, but it might be insignificant.

In theory, it would be possible to measure the cost of bounds checking on basic types like Vec by modifying Rust to disable them and running a large-scale ecosystem test. This would give some kind of rule of thumb, but without doing it, it's quite hard to know whether this will be closer to a ten percent or a tenth of a percent overhead.

There are some ways you can do better than timing and guessing, though. These rules of thumb apply mostly to desktop-class hardware; lower end hardware or something that targets a different niche will have different characteristics.

If your indices are derived from the container size, there is a good chance that the compiler might be able to eliminate the bounds checks entirely. At this point the only cost of the bounds checks in a release build is that it intermittently interferes with optimizations, which could, but normally doesn't, impede other optimizations.

If your code is branchy, memory access heavy or otherwise hard to optimise, and the bounds to check are easy to access, there is a good chance that bounds checking will manage to happen mostly in the CPU's spare bandwidth, with branch prediction helping out specifically, in which case the overall cost will be particularly small, especially compared to the cost of the rest of the code.

If your bounds to check are behind several layers of pointers, it is plausible that you will hit issues with memory latency, and will suffer correspondingly. However, it is also plausible that speculation and prediction machinery in the CPU will manage to hide this; this is very context-dependent. If you are taking references to the data inside, rather than dereferencing it at the same time as the bounds check, this risk magnifies.

If your bounds checks are in a tight arithmetic loop that doesn't saturate the core, you aren't likely to hurt throughput directly except by impeding other compiler optimisations. However, impeding other compiler optimisations can be arbitrarily bad, anywhere from no difference to preventing SIMD and causing a factor-10 slowdown.

If your bounds checks are in a tight arithmetic loop that does saturate the core, you take on the above risk and have a direct execution penalty of roughly half a cycle per bounds check.

If your code is large enough to stress the instruction cache, then you need to worry about the impact on code size. This is normally modest, but is particularly hard to measure the runtime impact of.

Peter Cordes adds some further points in comments. First, bounds checks imply loads and stores, so you're going to be running a mixed load which is most likely to bottleneck on issue/rename. Second, even predicted branches executed in parallel take resources from the predictor, which can cause other branches to predict worse.

This might seem intimidating, and it is. That is why it's important to measure and understand your performance at the level that is relevant for you and your code.

It is also the case that since Rust was "born" with bounds checking, it has produced means to reduce their cost, such as pervasive zero-cost references, iterators (which absorb, but don't actually remove, bounds checks), and an unusual set of nice utility functions. If you find yourself hitting a pathological case, Rust also offers unsafe escape hatches.

"saturate the core's functional units": usually you bottleneck on the front-end's ability to decode instructions and feed them to the out-of-order core: "issue/rename" throughput is usually the narrowest part of the pipe, and the bottleneck you're most likely to hit in code with a mix of loads, stores, and ALU. (And if you need bounds checks, there must be some loads and/or stores). I guess you could bottleneck on load ports throughput in code with lots of loads that also need bounds checks, but hit in L1D cache. — Peter Cordes, Nov 29 '17 at 03:32
More branches (even ones that always go the same way) may have a small negative impact on branch-prediction rates. They dilute the branch history for the interesting branches. Anyway, +1 nice answer. As always, the answer to performance questions is "it depends on context" :P (And the target hardware. Modern x86 is fantastic at plowing through fluff, especially off the critical path, but low-power CPUs have narrower pipes. Even Xeon Phi / Silvermont could see more impact from bounds checking.) — Peter Cordes, Nov 29 '17 at 03:34
@cmaster Rust uses `usize` indices, so you don't need to check for `>= 0`, and the bounds to check against are going to be loaded already in almost every throughput-bound situation (because you're probably in a tight loop, and the load was hoisted). — Veedrac, Nov 29 '17 at 10:32
Oh that's an interesting point. If you ever compile for obsolete 32-bit x86, the extra register pressure from array bounds is problematic (7 GP regs + stack pointer). Fortunately x86 has fast L1D caches (2 loads per clock on big-core CPUs (not Atom/Xeon Phi), so comparing a register against memory is almost exactly as cheap as reg,reg (you can copy the bound to the stack outside the tight loop so you don't need a pointer to the container control-block in a register inside the loop). But register pressure is non-trivial on x86-64 (15 GP regs + RSP) and ARM32 (14 GP regs + stack + PC). — Peter Cordes, Nov 29 '17 at 16:26
what does it mean to "absorb" bounds checks without removing them? if I stick to built in iteration methods like map filter etc do I avoid checks? — Joseph Garvin, Dec 04 '20 at 21:35
@JosephGarvin The iterator `next()` is itself a bounds check. A loop `for i in 0..x.len() { x[i]; }` is in effect doing a bounds check twice; once when checking `i < x.len()` in the loop, and again with `x[i]`, checking the same `i < x.len()`. An iterator combines these checks into one. — Veedrac, Dec 05 '20 at 12:06
The other issue is that LLVM is a giant blackbox and code changes can have unpredictable downstream effects on optimizations. I've seen cases where bounds checks somehow had *negative* cost, that is, it was faster than the same code without the bounds checks. — Antimony, Aug 07 '21 at 02:15

Does Rust's array bounds checking affect performance?

1 Answers1

Linked