3

Disclaimer: I don't have much experience with reverse-engineering byte-code, so please don't be too harsh on me if that could "easily" answer my question.

On modern processors, branching can be extremely expensive if prediction fails (see Why is it faster to process a sorted array than an unsorted array?).

Let's say I have some short-circuit evaluation in Java like this:

if (condition && (list!=null) && (list.size()>0)) /* Do something */ ;

Is that basically equivalent to a bunch of branches like this:

if (condition) {
    if (list!=null) {
        if (list.size()>0) {
            // Do something
        }
    }
}

or does Java have some other way to do the short-circuiting more cleverly?

In other words, would it be better to avoid at least one branching by rewriting the line like this:

if ((condition & (list!=null)) && (list.size()>0)) /* Do something */ ;

since the simple list!=null-check is much less expensive than a potentially ill-predicted branching?

(Clearly I can't get rid of the second && without risking a NullPointerException.)

Now, before I get ripped to shreds with statements like "premature optimization is the root of all evil!", please keep in mind that this is a choice between general coding habits (always use short-circuiting vs. never use short-circuiting unless required), that will affect pretty much all of my code, so making sure that I use the right habit here is definitely worth spending some thought on.

Community
  • 1
  • 1
Markus A.
  • 12,349
  • 8
  • 52
  • 116
  • 2
    1) premature optimization *IS* the root of all evil!" 2) use of `&&` is a Very Good habit to get into. – paulsm4 Jun 17 '16 at 22:25
  • @paulsm4 1) See above... 2) Are you SURE? I used to think so, too until this thought hit me. Do you actually KNOW how Java implements the short circuiting and how it affects performance? – Markus A. Jun 17 '16 at 22:26
  • If `condition` is false then why do you want to check if the list is null? – Jonny Henly Jun 17 '16 at 22:28
  • @JonnyHenly Because if the short-circuiting implies a conditional goto to skip over the null-check, that jump in the execution pointer (if not predicted correctly by the processor) flushes out the processor's execution pipeline and wastes many tens of clock cycles to recover (see link in the question). So the potentially unnecessary null-check might be significantly cheaper on average. – Markus A. Jun 17 '16 at 22:30
  • 4
    1) Yes, I do know how how Java implements short circuiting, and 2) I sincerely believe that worrying about extremely low-level, highly CPU-specific behavior in a language that does everything possible to abstract away the underly platform is ... to be charitable ... "misguided". Pointless. A waste of time. Almost certain to vary from situation to situation. If in doubt, I encourage you to try out a few benchmarks and post the results :) – paulsm4 Jun 17 '16 at 22:34
  • See also this link: [Why is branch prediction faster than no branch at all[](http://stackoverflow.com/questions/16879071/why-is-branch-prediction-faster-than-no-branch-at-all) – paulsm4 Jun 17 '16 at 22:38
  • @paulsm4 1) How does it do it? :) Could you post this as an answer? 2) Amen... And that's why most modern software runs only marginally faster on a ten-core 3.4GHz CPU than the equivalent piece of classic software did on a 486 with 33MHz... – Markus A. Jun 17 '16 at 22:38
  • The conditional will involve a condition evaluation and a branch, no matter what. It's not clear how doing multiple condition evaluations and a branch is ever going to be faster than. – pvg Jun 17 '16 at 23:02
  • @pvg If the processor predicts `condition` to be `false` if in fact it is `true`, for example, and pre-loads it's pipeline with the code following the entire statement, it needs to then flush the pipeline to include the `list!=null` check and continue from there. The cost of a branch mis-prediction is independent of whether it's a false positive or false negative (if I understand correctly). – Markus A. Jun 17 '16 at 23:07
  • Well, for one thing I can't actually tell the difference between your two expressions, they seem identical. For another, it's pretty hard to reason about what the CPU do by staring at Java code. Cpus execute speculatively, do all sorts of things. I'd be very surprised if you can actually come up with a non-short-circuit Java expression that's faster than the short circuit one. – pvg Jun 17 '16 at 23:10
  • @pvg The difference is that the second expression involves only two branches (one conditional upon `condition & (list!=null)` and the other conditional upon `list.size()>0`) while the first expression involves a branch after every one of the parts of the expression, i.e. 3 branches. Each one of them may be (independently) mis-predicted. Giving the processor one less chance to screw up sounds like a good idea to me, especially since it might actually be FREE if the added null-check consumes only one clock-cycle just like the branch would that we just eliminated. – Markus A. Jun 17 '16 at 23:14
  • 2
    You have now spent how much time researching and chatting about this, when in all likelihood it makes no *noticeable* difference to your code? Measurable maybe, but noticeable? The concept of *premature optimization* is about spending your very valuable time where it matters, so by wasting time on something that very likely *won't* matter, you'll have that much less time to spend on the code that really matters, before your deadline. Sure, if you have spare time, you can try to optimize, but how to you *know* that you attempts are helping and not hurting performance?? – Andreas Jun 17 '16 at 23:19
  • 2
    Oh I see, your & is not a typo. You will, no matter what, have a branch. Your theory seems to be that a _possible_ branch mis-prediction is necessarily more expensive than whatever other operation. That's quite unlikely, in the general case. You can try to find out but it would be pretty hard, from a high level language. I think you're vastly overestimating the misprediction frequency and cost too - the answer you link is a synthetic benchmark that compares optimal vs a suboptimal case. – pvg Jun 17 '16 at 23:28
  • @pvg I'm not saying it's *necessarily* more expensive. I'm just trying to understand *if* and *when* it matters... And, yes, you will, no matter what, have a branch. But I want to know whether it really is just one branch, or maybe two or three. – Markus A. Jun 18 '16 at 00:23
  • 3
    That really depends on whether the branches are likely to be correctly predicted or not which is entirely unlike the synthetic benchmark. You can try to measure an at least mildly realistic version of this where you put two operations of identical cost on your &&, make one of them false and other one variable. Swap around to see if you get a measurable effect. But again, fiddling with this in Java is likely a wild goose chase. – pvg Jun 18 '16 at 00:47
  • 1
    There is a good question in there somewhere, but it's up to you to make it one. First, as others have pointed out, while everyone knows it's better if your memory access is sequential and your branches predictable, it's hard to control for low-level CPU specifics from a high level languge. These effects can be exposed in synthetic benchmarks but it's harder to reason about them from realistic code. Second, speaking of homework, you ought to do yours and read up on branch prediction so you have a basis against which to evaluate your assumptions. – pvg Jun 18 '16 at 04:17
  • 2
    And lastly, and most importantly, performance analysis is only meaningful when coupled with measurement. Can you turn your question into something you have measured that others can try and help explain? Doing those things can help keep SO a site that's more like the one you want it to be rather than the one you're grumpy about. – pvg Jun 18 '16 at 04:18
  • I think the question in its current form suggests a certain misunderstanding - branch prediction is fundamentally statistical, the very point of the example you link. What would be a meaningful answer to your question as stated now? I don't mean a correct answer, just simply a logically consistent one. – pvg Jun 18 '16 at 04:46
  • 1
    @pvg I was really just looking for a slightly verbose yes or no answer. Something like: "I know how to do byte-code analysis, so I took a look. Here's what your condition is compiled into: ... As you can clearly see Java does need to use 3 conditional gotos to express the short-circuit evaluation." or " ... As you can clearly see, Java does XYZ clever thing instead, so you actually do end up with only one conditional goto." or "... As you can see, the two are actually identical, because Java realizes that the null-check is so cheap that it's not worth risking a mis-predicted branch instead." – Markus A. Jun 18 '16 at 04:52
  • 4
    Bytecode does not map 1-1 to CPU instructions. The same bytecode can be run interpreted, and JIT'ed with different degrees of optimization, that is, running different CPU instructions over the course of its lifetime. Hence all the 'you can't easily reason about this from Java code' talk. – pvg Jun 18 '16 at 04:59
  • 1
    You don't agree with it? Great! Write up a question about the things you've measured that substantiate your disagreement. If you want a sense of how far removed high-level language code from machine instructions dispatched by the CPU is, this can be an informative read: https://webkit.org/blog/3362/introducing-the-webkit-ftl-jit/. Or if you're short on time, you can just ask yourself why 'HotSpot' is named 'HotSpot'. – pvg Jun 18 '16 at 05:40
  • 1
    @pvg ... Which, of course, usually isn't feasible... But, for completeness, I will write up a benchmark for this next week when I have some time, although I'm pretty sure I can already tell you now what it'll look like. – Markus A. Jun 18 '16 at 06:24
  • 2
    It makes perfect sense if you consider the magnitude of the effects in question. If you're building a bike, you might want to try putting in , say, different ball-bearings to make it faster. You'd probably run some tests and measurements to find out. You'd have a hard time telling whether your bike is better or worse at being slowed down by Solar neutrinos, though. – pvg Jun 18 '16 at 06:41

2 Answers2

1

Nothing here mentions any kind of branching. This is a mere expression evaluation, a && in an if statement is no different from a && in an expression. Would you ask the same question with the following code ?

boolean isValid = condition && (list!=null) && (list.size()>0);
if (isValid) {
    ...
}

This is basically what happens, the expression gets evaluated and then the branching happens.

Dici
  • 25,226
  • 7
  • 41
  • 82
  • True, it's not mentioned there. But the skip over the short-circuited conditions DOES have to be implemented somehow and I can't really think of a way to do it other than with branching (at least in the general case). So, yes, I would ask the same question with the code you added. There, the branching would happen in the `boolean isValid = ...`-line. – Markus A. Jun 17 '16 at 22:56
  • 2
    Fair point. However, I kind of agree with the others, you can't really assume that the branch prediction is going to be bad while it's supposed to be an optimization. Just because it makes things worse on a particular example does not mean you should worry about it. The very definition of *premature optimization* is to try to optimize before you know you need to optimize. This is not a good process. In general, you'll find your code runs fast enough and when it's not the case, you'll look for the bottlenecks. I bet you in 99% of the cases your bottleneck won't have anything to do with branches – Dici Jun 17 '16 at 23:05
  • Is it really "supposed to be an optimization"? Or is it just another semantic tool like ternary operations that basically just saves you some typing? If it's just a semantic tool, I think it makes perfect sense to try to understand what it stands for and when and how to use it efficiently. – Markus A. Jun 17 '16 at 23:16
  • 2
    I do a lot of mobile development where every clock cycle is not only precious in terms of your app running fast on slow devices, but also in terms of battery life consumed, even on the fastest devices available. So, "in general, you'll will find your code runs fast enough" is not a statement I am willing to subscribe to. No code is ever fast enough. I'm sure it was people with this *premature optimization*-phobia that wrote InkScape, for example. I'd be embarrassed to release software that runs so sluggish on a Core-i7 if Corel Draw could do the exact same things in real time on a 486! – Markus A. Jun 17 '16 at 23:54
  • 4
    Then you should probably code in binary, and I'm not even joking. There are so many things you do not control in a high-level language such as Java that I can't take your argument of "every clock cycle matters". If that was true, you would be worried about JIT compilation, garbage collection, all these things that happen outside of your control, and you'd end up using another language. Without seeing any of your code, I really think reducing the number of branches (imaginary or not) is not impactful at all compared to other possibilities of optimization. – Dici Jun 18 '16 at 01:59
  • ??? Just because there might be a few things I can't control means I shouldn't worry about the things I can control? Strange conclusion... That's like telling a drug company "Rather than worry about finding remedies for a cold (which doesn't matter much to begin with), you should instead try to clone a new species that is different enough from current humans that none of the current illnesses can infect it."... I realize, most people don't have time to worry about this problem, and that's fine... I'm not trying to claim this is a huge issue that everyone should care about... But I do care. – Markus A. Jun 18 '16 at 03:04
  • @MarkusA.What people are trying to explain is that you shoult NOT care, for your own sanity. And as someone said, if you're not satisfied by this answer you can always benchmark your code and confirm that this makes no difference in most cases. There are many other things that you cannot control and have far greater impact on performance. – Dici Jun 18 '16 at 11:17
1

I made a simple test

int sum = 0;
Random rnd = new Random(1);

int[] a = new int[1000];
int[] b = new int[1000];
for (int i = 0; i < 1000; ++i)
{
    a[i] = rnd.nextInt(100);
    b[i] = rnd.nextInt(100);
}

long started = System.nanoTime();

for (int i = 0; i < 1000000; ++i)
{
    for (int j = 0; j < 1000; ++j)
    {
        if (a[j] < 50 && b[j] < 5)// change "&& b[j] < 5"
        {
            sum++;
        }
    }
}

long ended =  System.nanoTime();
System.out.println((ended - started)/1000000 + "  " + sum);

The results are quite random:

            &&      &
b[j] < 5    1450    1360
b[j] < 50   1330    1610
b[j] < 500  1310    1450
j < 50      2200    920
j < 500     1410    1730
j < 5000    1180    1040
j < i       1180    2050
i < j       2290    1450

Those are lowest values from more runs and I made sure they were repeatable. The actual times vary from run to run. As a rule of thumb, I would avoid being too fancy, stick to the && and hope for optimizations "down there" do the best. There is nice video about optimizations

EDIT

As Dici pointed out, JVM warm up should be done. It seems first and second call of the function is different from the rest. The rule apply consistently even I when increased loop count. So I retested that and... got another mess. About 2x faster on average, but mess again. And the optimizations vere more unstable, there were often not one, but two typical time values even 50% different one from another. I looked at JMH, nice framework. I could probably measure if & is faster than && if I really tried (on different systems, different hardware and so on, a lot of work). But it is not the question. The question is, if I replace && for & in my program can I expect it to be faster or slower? And the answer is - You can not expect anything, You have to measure it.

EDIT2

I see it as a waste of time in this case, but to uphold my credibility I measured standard deviations (aka one sigma). Values behaved well enough in few runs and they behave well in thousand of runs, no surprise here (I did not show result after JVM warm up as they did not behave well and statistics would be inevitable). Funny thing is that all results are about 7% faster than those I measured previsously, which is beyond five sigma for most values. From few tries it seems that web browser tabs affect the whole system speed, by no, I am not going to make statistics to confirm my observation.

            &&          &
b[j] < 5    1333(14)    1265(25)
b[j] < 50   1231(11)    1514(13)
b[j] < 500  1223(9)     1360(13)
j < 50      2069(74)    842(11)
j < 500     1294(12)    1631(17)
j < 5000    1089(9)     957(8)
j < i       1086(8)     1907(23)
i < j       2164(16)    1357(14)
Antonín Lejsek
  • 6,003
  • 2
  • 16
  • 18
  • Wow. What a mess... The results that is, not your code, of course :)... So, it seems in most cases, it's irrelevant, but sometimes it matters a lot, and it can actually go either way how it matters... But there must be something else going on, since some of the numbers within each column don't make any sense relative to each other. For example, how can it be that `j<50`, which returns `false` 5% of the time very predictably and quickly (`j` should just be available immediately) is much SLOWER than `b[j]<5`, which also is `false` 5% of the time, but unpredictably and requiring array indexing?!? – Markus A. Jun 20 '16 at 04:36
  • @MarkusA. I agree, it does not make sense. I tried to modify it with more randomness and tested in C# and C++ too. The result - it still does not make sense. C++ seems closest to what one expects at least. I have to repeat idea from the video here - the only way to tell what is faster is to measure it. – Antonín Lejsek Jun 21 '16 at 02:22
  • 2
    No wonder why the results are random: this benchmark literally relies on randomness. A real benchmark should **1)** warm the JVM up, **2)** make sure the input will always be the same, for example by picking a specific seed for the `Random` instance, **3)** run each test several times to get statistical accuracy of the results (variance, confidence interval etc). If you want to do a real benchmark I recommend using a benchmark framework rather than doing it manually and get meaningless results – Dici Jun 25 '16 at 21:41
  • 1
    @Dici 2) What I did. 3) What I did. Applying statistics is a bit tricky, interval for example is quite useless, one value off due to GC or whatever ruins the result completely. 1) I did not do that, I expected many times repeated run would be enough, seemingly I was wrong. I would try to repair it. – Antonín Lejsek Jun 25 '16 at 22:47
  • 2
    Hem, statistical analysis is not pointless because of random effects such as GC, it is RELEVANT for these very reasons. The more you reproduce the same experiment, the more you eliminate the noise around your measure and are able to get significant results. We have no idea of how good are you measures because we have no confidence interval. It could be the case that the confidence intervals are huge, which would make the benchmark unsignificant. Also, I don't know how I missed the `new Random(1)`. You did that well, my bad :) – Dici Jun 26 '16 at 00:46
  • @Dici I agree it is not pointless, just have its own problems. Let say we measure 99x 200ms and once 1000ms. Standard statistics would say You that the results is 208ms with 80ms deviation. It is true, but we would rather like to know that the typical time is close to the best and it is 200ms. – Antonín Lejsek Jun 26 '16 at 14:18