3

I just came across this article: Compute the minimum or maximum of two integers without branching

It starts with "[o]n some rare machines where branching is expensive...".

I used to think that branching is always expensive as it often forces the processor to clear and restart its execution pipeline (e.g. see Why is it faster to process a sorted array than an unsorted array?).

This leaves me with a couple of questions:

  • Did the writer of the article get that part wrong? Or was this article maybe written in a time before branching was an issue (I can't find a date on it).
  • Do modern processors have a way to complete minimal branches like the one in (x < y) ? x : y without performance degradation?
  • Or do all modern compilers simply implement this hack automatically? Specifically, what does Java do? Especially since its Math.min(...) function is just that ternary statement...
Community
  • 1
  • 1
Markus A.
  • 12,349
  • 8
  • 52
  • 116

1 Answers1

6

Did the writer of the article get that part wrong? Or was this article maybe written in a time before branching was an issue (I can't find a date on it).

The oldest comment is 5 years old, so it's no hot news. However, unpredictable branching is always expensive and so was it 5 years ago. In the meantime, it just got worse as modern CPUs can do much more per cycle and a mispredicted branch therefore cost more work.

But in a sense, the writer is right. The majority of CPUs is not found in our PCs and servers, but in embedded devices, where the situation differs.

Do modern processors have a way to complete minimal branches like the one in (x < y) ? x : y without performance degradation?

Yes and no. AFAIK Math.max gets always translated as a conditional move, which means no branching. You own max may or may not use it, depending on statistics the JVM collected.

There's no silver bullet. With predictable outcomes, branching is faster. Finding out exactly, what pattern the CPU recognizes, is hard. The JVM simply looks at how often a branch gets takes and uses a magic threshold of about 18%. See my own question and answer for details.

Or do all modern compilers simply implement this hack automatically? Specifically, what does Java do? Especially since its Math.min(...) function is just that ternary statement...

It's actually a compiler intrinsic. Whenever the JITc sees this very method called, it handles it specially. When you copy the method, it gets no special treatments.

In this case, the intrinsic is not very useful, as it's something what gets heavily optimized anyway. For methods like Long#numberOfLeadingZeros, the intrinsic is essential, as the code is rather long and slow and modern CPUs get do it in a single cycle.

Community
  • 1
  • 1
maaartinus
  • 44,714
  • 32
  • 161
  • 320
  • 1
    Good catch on the embedded devices. It also looks like the author simply copied the sentence from the Stanford-page that he links to at the end. And **that** article, not only seems like it was written in 1997 (again, no clear date, just a copyright range), but it also qualifies the statement further as "... rare machines where branching is very expensive **and no condition[al] move instructions exist** ...". – Markus A. Oct 01 '14 at 18:23
  • I just noticed. Why is `numberOfLeadingZeros` implemented using a number of branches, while it can be expressed as `64-highestBit(n)` which can be calculated as `bitCount(highestOneBit(n) - 1) + 1` both of which are series of binary operations without any branch. ...? – Mark Jeronimus Oct 02 '14 at 19:44
  • @MarkJeronimus No idea, but maybe the branches could be replaces by conditional moves? And maybe just nobody cared, as on modern CPUs it gets replaced by `lzcnt` anyway and on weak CPUs the branches may be cheaper. You could copy the methods (so that the intrinsics don't get used) and benchmark it (but please use JMH or Caliper, otherwise it's just random noise). I'd be interested, but I really should do other things. – maaartinus Oct 02 '14 at 19:55
  • The penalty of each branch may amount to more operations as the machines become more aggressive, but that's not really an increase in penalty since the alternative is simply not performing that extra work anyway. On the other hand, branch predictors *have* improved significantly over time. – Leeor Oct 05 '14 at 16:47
  • @Leeor I disagree. An alternative in one sense is to predict correctly. In another sense, an alternative is to use conditional move (not always applicable, but matches the question). But yes, predictors have improved and pipelines have got shorter (according to what we compare), so it has got both better and worse. I'm curios how it develops in the future. – maaartinus Oct 05 '14 at 17:33
  • 1
    Replace "alternative" with "baseline", then. I'm just saying that you're not worse off just because you're running ahead faster (even if you could do better yet with conditional moves). – Leeor Oct 06 '14 at 00:15