2

I'm writing a small library in rust as a 'collision detection' module. In trying to make a line-segment - line-segment intersection test as fast as possible, I've run into a strange discrepancy where seemingly useless code is making the function run faster.

My goal is optimising this method as much as possible, and knowing why the executable code changes performance as significantly as it does I would imagine would be helpful in achieving that goal. Whether it be a quirk in the rust compiler or something else, if anyone is knowledgeable to be able to understand this, here's what I have so far:

The two functions:

  fn line_line(a1x: f64, a1y: f64, a2x: f64, a2y: f64, b1x: f64, b1y: f64, b2x: f64, b2y: f64) -> bool {
     let dax = a2x - a1x;
     let day = a2y - a1y;
     let dbx = b2x - b1x;
     let dby = b2y - b1y;

     let dot = dax * dby - dbx * day;
     let dd = dot * dot;

     let nd1x = a1x - b1x;
     let nd1y = a1y - b1y;
     let t = (dax * nd1y - day * nd1x) * dot;
     let u = (dbx * nd1y - dby * nd1x) * dot;
     u >= 0.0 && u <= dd && t >= 0.0 && t <= dd
  }
  
  fn line_line_if(a1x: f64, a1y: f64, a2x: f64, a2y: f64, b1x: f64, b1y: f64, b2x: f64, b2y: f64) -> bool {
     let dax = a2x - a1x;
     let day = a2y - a1y;
     let dbx = b2x - b1x;
     let dby = b2y - b1y;

     let dot = dax * dby - dbx * day;
     if dot == 0.0 { return false } // useless branch if dot isn't 0 ?
     let dd = dot * dot;

     let nd1x = a1x - b1x;
     let nd1y = a1y - b1y;
     let t = (dax * nd1y - day * nd1x) * dot;
     let u = (dbx * nd1y - dby * nd1x) * dot;
     u >= 0.0 && u <= dd && t >= 0.0 && t <= dd
  }

Criterion-rs bench code:

  fn criterion_benchmark(c: &mut Criterion) {
     c.bench_function("line-line", |b| b.iter(|| line_line(
        black_box(0.0), black_box(1.0), black_box(2.0), black_box(0.0), 
        black_box(1.0), black_box(0.0), black_box(2.0), black_box(4.0))));
     c.bench_function("line-line-if", |b| b.iter(|| line_line_if(
        black_box(0.0), black_box(1.0), black_box(2.0), black_box(0.0), 
        black_box(1.0), black_box(0.0), black_box(2.0), black_box(4.0))));
  }

Note that with these inputs, the if statement will not evaluate to true.

assert_eq!(line_line_if(0.0, 1.0, 2.0, 0.0, 1.0, 0.0, 2.0, 4.0), true); // does not panic

Criterion-rs bench results:

line-line               time:   [5.6718 ns 5.7203 ns 5.7640 ns]
line-line-if            time:   [5.1215 ns 5.1791 ns 5.2312 ns]

Godbolt.org rust compilation results for these two pieces of code.

Unfortunately I cannot read asm as well as I'd like. I do not know how to ask rustc to compile this without performing an inline of the entire calculation within the website either, but if you can do that I hope this may be of help?

My current environment is:
-Windows 10 (latest standard)
-Ryzen 3200G cpu
-Standard Rust install (no nightly/beta), compiling with znver1 (last I checked)

Testing is being done with cargo bench command, which I assume to be making optimizations, but haven't found a confirmation of this anywhere. Criterion-rs is doing all of the benchmark handling.

Thank you in advance if you're able to answer this.

EDIT:

Comparing the code in the godbolt link with -C opt-level=3 --target x86_64-pc-windows-msvc (the toolchain used by my installation of rustc currently) reveals that the compiler has inserted many unpcklpd and unpckhpd instructions into the function which may be slowing the code down rather than optimizing it. Setting the opt-level parameter to 2 does remove the extra unpck instructions without otherwise significantly changing the compilation result. However, adding [profile.release]
opt-level = 2
While causing a recompilation, the backwards performance remained, same at opt-level = 1. I do not have target-cpu=native specified.

sfbea
  • 159
  • 1
  • 11
  • 1
    IDK why, but Chrome is choking on opening that Godbolt link. Did you remember to enable optimization? If not (or *possibly* even with optimization if the compiler had to store/reload on the critical path for latency), [Adding a redundant assignment speeds up code when compiled without optimization](https://stackoverflow.com/q/49189685) is the usual suspect when extra work speeds something up on an Intel CPU. Speaking of which, what hardware did you test on? e.g. a CPU model number like i7-6700k; OS and compiler versions / options might also be relevant. – Peter Cordes Feb 27 '21 at 12:18
  • 2
    I don't understand why you think your control statement is useless – Stargateur Feb 27 '21 at 12:34
  • @PeterCordes I have updated the question with what I hope to be all the relevant information (last paragraphs), and the link should hopefully be fixed. – sfbea Feb 27 '21 at 12:46
  • @Stargateur Technically it might return a result faster if the two lines are colinear- yes, even though that is an intersection if they're close enough, which I might want to include to be detected as well. I would still want to know how to preserve the perf gain if I were to change it though. This question is not about implementation however, this code might have a bug in it for all I know, but the perf anomaly I'd still like to understand. – sfbea Feb 27 '21 at 12:50
  • 1
    This doesn't answer my question. Why do you think the `if` should not improve performance ? You give us two piece of code that **doesn't have the same behavior** so it's not possible to answer you. You compare apple and banana. – Stargateur Feb 27 '21 at 12:52
  • As it performs an extra comparison and creates a branch (which can be mispredicted, causing potentially substantial loss in performance), the other doesn't, but otherwise should be identical when executed using inputs that intersect as expected. (Response to analogy: this is an apple and an apple with some salt on it, but I didn't expect the apple with salt on it to taste better.) – sfbea Feb 27 '21 at 12:56
  • 1
    The only really notable thing about the assembly that I see is the branch itself -- the rest of it looks pretty much the same (I didn't work through the algorithm). Based on that, I don't think we're going to find the answer in the assembly code, which means it's either the CPU itself doing something clever (instruction reordering, maybe? I notice the branch version uses 1 additional register compared to the branchless one) or the data isn't actually what you expect. Can you create a [mre] that demonstrates the difference in performance? – trent Feb 27 '21 at 13:50
  • @trentcl Indeed, the branch appears to be the only main difference, and thus I'd expect it to run slightly slower where the lines aren't colinear. As to your guess of instruction reordering, I don't imagine that would create less work, unless its removing instructions entirely. As for a minimal reproducible example, while more or less all the code is here, here pastebin.com/4GUrZjAT is the entire bench.rs. [edit: incorrect statement was here] – sfbea Feb 27 '21 at 14:38
  • https://pastebin.com/911PyLca here is an example with a simpler algorithm, and once again the function with an if statement is faster. – sfbea Feb 27 '21 at 14:41
  • As a further note, adding a second if statement to the aabb test code reduced performance to about equal the that of the function without the if statement (the second was ```if y == 0 { return false}``` which also will not evaluate to true given the test parameters). – sfbea Feb 27 '21 at 15:06
  • 1
    I tested your most minimal version on my machine and the version with `if` is **slower** for me by about 15%, so it's got to be something more specific. Do you get the same results without `target-cpu=native`? Maybe someone else with a Windows machine can try to replicate them. Or maybe it's some Ryzen-specific feature. – trent Feb 27 '21 at 15:21
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/229294/discussion-between-trentcl-and-sfb-dragon). – trent Feb 27 '21 at 15:26
  • It could be something like the version with the `if` getting better code alignment. Can you reproduce a performance change outside of a microbenchmark? ~6ns is pretty fast and subject to noise from things like this. – Tavian Barnes Feb 27 '21 at 21:18
  • See the paste in another comment for a different example (AABB-point test) with the same results. I have run these benchmarks repeatedly to the same results, and re-ordered them for good measure. The one version is definitely slower on my machine, but one other has tested it on theirs's to different results, as the issue is likely dependant on toolchain and CPU. – sfbea Feb 27 '21 at 23:10

0 Answers0