Strange performance drop after innocent changes to a trivial program

Question

Imagine you want to count how many non-ASCII chars a given char[] contains. Imagine, the performance really matters, so we can skip our favorite slogan.

The simplest way is obviously

int simpleCount() {
    int result = 0;
    for (int i = 0; i < string.length; i++) {
        result += string[i] >= 128 ? 1 : 0;
    }
    return result;
}

Then you think that many inputs are pure ASCII and that it could be a good idea to deal with them separately. For simplicity assume you write just this

private int skip(int i) {
    for (; i < string.length; i++) {
        if (string[i] >= 128) break;
    }
    return i;
}

Such a trivial method could be useful for more complicated processing and here it can't do no harm, right? So let's continue with

int smartCount() {
    int result = 0;
    for (int i = skip(0); i < string.length; i++) {
        result += string[i] >= 128 ? 1 : 0;
    }
    return result;
}

It's the same as simpleCount. I'm calling it "smart" as the actual work to be done is more complicated, so skipping over ASCII quickly makes sense. If there's no or a very short ASCII prefix, it can costs a few cycles more, but that's all, right?

Maybe you want to rewrite it like this, it's the same, just possibly more reusable, right?

int smarterCount() {
    return finish(skip(0));
}

int finish(int i) {
    int result = 0;
    for (; i < string.length; i++) {
        result += string[i] >= 128 ? 1 : 0;
    }
    return result;
}

And then you ran a benchmark on some very long random string and get this The parameters determine the ASCII to non-ASCII ratio and the average length of a non-ASCII sequence, but as you can see they don't matter. Trying different seeds and whatever doesn't matter. The benchmark uses caliper, so the usual gotchas don't apply. The results are fairly repeatable, the tiny black bars at the end denote the minimum and maximum times.

Does anybody have an idea what's going on here? Can anybody reproduce it?

FYI, Java microbenchmarks are difficult to get right: http://www.ibm.com/developerworks/java/library/j-jtp12214/ — Paul Draper, Nov 03 '13 at 09:43
I don't understand how the smart algorithm is smarter than the simple one. It simply uses two loops instead of one, and each loop does basically the same thing: comparing every character with 256. So the algorithm is more complex, and thus less optimizable. — JB Nizet, Nov 03 '13 at 09:59
@JB Nizet: But these are separate methods, so they can be optimized independently. And they're also very trivial, even when everything gets inlined. Actually, the original problem is more complicated and special-casing ASCII would help. — maaartinus, Nov 03 '13 at 10:08
You're doing it the right way: you measure, using good tools, the performance of every algorithm. So just pick the one that is the fastest. Trying to guess how the JIT optimizes the code is (at least to me, but I guess to almost everyone) an impossible task. — JB Nizet, Nov 03 '13 at 10:46
As an aside, ASCII ends at 127, not 255. In unicode, the 128-255 range is the Latin-1 Supplement block. — Tom Anderson, Nov 03 '13 at 10:50
You don't have to guess how the JIT optimises the code - you can use the [-XX:+PrintAssembly](https://wikis.oracle.com/display/HotSpotInternals/PrintAssembly) flag to find out exactly what is being executed. I suspect the real challenge will be understanding why some particular bit of compiled code is fast or slow. — Tom Anderson, Nov 03 '13 at 11:05
Could it be that Java is smart enough to optimize array access not to include boundary checks in simpleCount? — Sami Korhonen, Nov 03 '13 at 11:15
Please separately benchmark `finish` and `skip` and show where the time is going. Is `skip` really as fast as you think it is? — Alex D, Nov 03 '13 at 14:55
@Alex D: The point is that `skip` bails out after a few chars (the probability that let's say all first 1000 chars are ASCII is negligible) and there is 1000000 chars in total. So whatever `skip` does and however long it takes, it can't matter. — maaartinus, Nov 03 '13 at 15:02
How.long are the very long Strings? Do they still fit in CPU cache? — jboi, Nov 03 '13 at 15:04
I can see only one difference. `timeSmart` subtracts `-1` in every loop. That would explain the performance difference to `timeSimple` but than, why is `timeSmarter` even slower? — jboi, Nov 03 '13 at 15:12
@jboi: They're 1000000 chars, see the linked [benchmark](https://dl.dropboxusercontent.com/u/4971686/published/maaartin/so/NonAsciiCounterBenchmark.java). You mean probably `skip` rather then `timeSmart` and the `-1` doesn't belong there, fixed (no different result). — maaartinus, Nov 03 '13 at 15:21
@maaartinus, you may be right, but when you are analyzing the performance of a program, and especially when the results are not as you expect, it is *never* a good idea to assume. Check everything, assume nothing. Even if you are right this time, this frame of mind will serve you well in the future when troubleshooting things. — Alex D, Nov 03 '13 at 17:48
Try this article: https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination — Sami Korhonen, Nov 04 '13 at 13:48
Won't `for ( int result = 0, i = 0; i < string.length ; i++ ) { result += ... }` do `string.length` additions and assignments? Since this _is_ a micro-optimization, it seems like it would make sense to avoid that extra work and just do `if ( ... ) result++;`. Then you only do O(`result`) additions and assignments. — Joshua Taylor, Nov 11 '13 at 04:19
@Joshua Taylor: The real program does it, this was a simplified example.... and actually the problem cause, see the accepted answer. The JVM is free to do such optimizations (it's not volatile) and usually does it, but not this time. — maaartinus, Nov 11 '13 at 04:36

score 3 · Answer 1 · edited Nov 03 '13 at 14:50

3

My tentative guess would be that this is about branch prediction.

This loop:

for (int i = 0; i < string.length; i++) {
    result += string[i] >= 128 ? 1 : 0;
}

Contains exactly one branch, the backward edge of the loop, and it is highly predictable. A modern processor will be able to accurately predict this, and so fill its whole pipeline with instructions. The sequence of loads is also highly predictable, so it will be able to pre-fetch everything the pipelined instructions need. High performance results.

This loop:

for (; i < string.length - 1; i++) {
    if (string[i] >= 128) break;
}

Has a dirty great data-dependent conditional branch sitting in the middle of it. That is much harder for the processor to predict accurately.

Now, that doesn't entirely make sense, because (a) the processor will surely quickly learn that the break branch will usually not be taken, (b) the loads are still predictable, and so just as pre-fetchable, and (c) after that loop exits, the code goes into a loop which is identical to the loop which goes fast. So i wouldn't expect this to make all that much difference.

edited Nov 03 '13 at 14:50

maaartinus

44,714
32
161
320

answered Nov 03 '13 at 11:00

Tom Anderson

46,189
17
92
133

That's good point... except for that the second loop gets terminated after a few chars and thus should play no role. – maaartinus Nov 03 '13 at 11:13
Hmm, yes. Have you experimentally verified the skip loop really does only run for a few characters? – Tom Anderson Nov 03 '13 at 21:24
Yes, I have. Did you consider that `string[i] >= 128 ? 1 : 0` can (but needn't) be translated using a branch? – maaartinus Nov 08 '13 at 09:50
Even if it was compiled as a branch, i would guess (this is all guesswork i'm afraid!) that it would be a branch that would not pose much of a problem for the processor: it would be a forward branch with a short distance to the rejoining of the two arms, so the processor should be able to speculate down both sides. – Tom Anderson Nov 08 '13 at 11:19
We don't need to speculate, I was curious and [investigated it here](http://stackoverflow.com/questions/19689214/strange-branching-performance). The JVM quite often fails to show its best. – maaartinus Nov 08 '13 at 11:24

score 3 · Accepted Answer · edited Nov 08 '13 at 11:24

3

Got it.

The difference is in the possibility for the optimizer/CPU to predict the number of loops in for. If it is able to predict the number of repeats up front, it can skip the actual check of i < string.length. Therefore the optimizer needs to know up front how often the condition in the for-loop will succeed and therefore it must know the value of string.length and i.

I made a simple test, by replacing string.length with a local variable, that is set once in the setup method. Result: smarterCount has runtime of about simpleCount. Before the change smarterCount took about 50% longer then simpleCount. smartCount did not change.

It looks like the optimizer looses the information of how many loops it will have to do when a call to another method occurs. That's the reason why finish() immediately ran faster with the constant set, but not smartCount(), as smartCount() has no clue about what i will be after the skip() step. So I did a second test, where I copied the loop from skip() into smartCount().

And voilà, all three methods return within the same time (800-900 ms).

Different runtimes

edited Nov 08 '13 at 11:24

Tom Anderson

46,189
17
92
133

answered Nov 04 '13 at 14:34

jboi

11,324
4
36
43

By "local variable, that is set once in the setup method" you mean a field, right? I extended [the benchmark](https://dl.dropboxusercontent.com/u/4971686/published/maaartin/so/NonAsciiCounterBenchmark.java) and got [some results](https://microbenchmarks.appspot.com/runs/218eb3e6-be40-4f38-9d93-4f78dca52360) showing that you're probably right. Agreed? – maaartinus Nov 08 '13 at 09:47
Field: Yes, that's what I mean. Do you also have the source code for `timeSmarter2` ? – jboi Nov 08 '13 at 10:35
Sure, see the [first link](https://dl.dropboxusercontent.com/u/4971686/published/maaartin/so/NonAsciiCounterBenchmark.java) in my comment above. I cached it (and the `string` too) in a local variable instead; this is something the JVM should do itself! – maaartinus Nov 08 '13 at 10:42
Ok, found it. Looks like the implementation of my changes, agreed. Really interesting, how loops (especially the condition) is optimizied. – jboi Nov 08 '13 at 10:54

Strange performance drop after innocent changes to a trivial program

2 Answers2