Efficient performance measurement

Question

In this question, I'd like to question the lore how to test the performance of Java code. The usual approach works along these lines:

long start = System.nanoTime();

for( int i=0; i<SOME_VERY_LARGE_NUMBER; i++) {
    ...do something...
}

long duration = System.nanoTime() - start;
System.out.println( "Performance: " 
    + new BigDecimal( duration ).divide( 
      new BigDecimal( SOME_VERY_LARGE_NUMBER, 3, RoundingMode.HALF_UP ) ) );

"Optimized" versions move the calls to System.nanoTime() into the loop, growing the error margin since System.nanoTime() takes much longer (and is way less predictable in the runtime behavior) than i ++ and the comparison.

My criticism is:

This gives me the average runtime but this value depends on a factors which I'm not really interested in: Like the system load while the test loop was running or jumps when the JIT/GC kicks in.

Wouldn't this approach be (much) better in most cases?

Run the code to measure often enough to force JIT compilation
Run the code in a loop and measure the execution times. Remember the smallest values and abort the loop when this value stabilizes.

My rationale is that I usually want to know how fast some code can be (lower bounds). Any code can become arbitrarily slow by external events (mouse movements, interrupts from the graphics card because you have an analog clock on the desktop, swapping, network packets, ...) but most of the time, I just want to know how fast my code can be under perfect circumstances.

It would also make performance measurement much faster since I don't have to run the code for seconds or minutes (to cancel unwanted effects out).

Can someone confirm/debunk this?

I often do something like that, but manually. I repeat the test many times and go with the lowest number. I have the same reasoning as you. Of course, if your alternative solutions differ in the amount of pressure on the GC, that should NOT be factored out. — Marko Topolnik, May 11 '12 at 10:09
Why would an "optimized" version move calls *into* the loop? No compiler would do this. — Ira Baxter, May 11 '12 at 10:10
@IraBaxter: I meant human "optimization" to get a better "precision" (since the new measurement wouldn't count the loop). But as often, the intuition is wrong here. — Aaron Digulla, May 11 '12 at 10:22
@MarkoTopolnik: I agree with keeping the GC in the picture when it makes sense but is there a way to tell this to a performance test? — Aaron Digulla, May 11 '12 at 10:23
No, but it invalidates the "best time" approach. It is indeed hard to predict the effects of GC in production, but ignoring it means failing to differentiate between a solution with perhaps a slightly worse best time, but better in terms of less GC'ing. — Marko Topolnik, May 11 '12 at 10:27
You want to warm up the JIT-compiler, run the code in a loop, but want to make performance measurements faster? Conflicting goals, imho. To factor out external programs, you can, on unixoid systems, use `time` which measures time consumed by your thread. The not-so-new visual-vm-tool has such a thing too. I usually parametrize my performance measurements, to increment the sample size, because it oszillates on the low end, and the size, where this happens, differs. Having big samples is usually more reliable, runtimes of several seconds. However: The samples have to be in the range of real live — user unknown, May 11 '12 at 10:28
@userunknown: Re JIT: Why do I measure performance? Usually, I want to know if version A is faster than B. So the solution should be able to reduce unwanted factors like JIT. My gut feeling is that statistical methods should be able to determine the sample size automatically. There is no point to run the tests 1 million times when 10'000 is enough or run it for 10 minutes if the result stabilizes after 10s. — Aaron Digulla, May 11 '12 at 10:34
If you know, that your code is only run on sample sizes of 10 000, alright. Bigger sizes might lead to cache misses above a certain point, decreasing the speed, while it seemed stable before. But me, I have a tool to print little, automated graphs, mainly for the reason to see, where it stabilizes; whether it stabilizes from some point. I don't understand why JIT is an unwanted factor. — user unknown, May 11 '12 at 10:41
Basically, it's damned difficult to measure the performance of, say, a small segment of C code, and, between GC and JIT and various delayed initializations, nearly impossible to measure the performance of a small segment of Java code. You CAN measure the performance of an *application* fairly well (including all disk and GC activity), but even then you must studiously control the test environment. — Hot Licks, May 11 '12 at 11:58

score 3 · Answer 1 · edited May 23 '17 at 11:48

I think what you're proposing is pretty reasonable, with some tweaks:

1) I would report the median -- or a bunch of percentiles -- rather than the minimum. If your code places a lot of strain on the garbage collector, simply taking the minimum can easily fail to pick up on that (all it takes is for a single iteration to fit between two consecutive GC pauses).

2) In many cases it makes sense to measure the CPU time rather then the wall-clock time. This takes care of some of the impact of having other code running on the same box.

3) Some benchmarking tools use two levels of loops: the inner loop repeatedly performs the operation, and the outer loop looks at the clock before and after the inner loop. The observations are then aggregated across the iterations of the outer loop.

Finally, the following gives a very good overview of JVM-specific issues to be aware of: How do I write a correct micro-benchmark in Java?

score 2 · Answer 2 · answered May 11 '12 at 11:55

2

You can use the -XX:CompileThreshold JVM option to specify when JIT kicks in. Then you can "warm up" your test by running a loop greater than the CompileThreshold before running the timed loop.

answered May 11 '12 at 11:55

Edward Samson

2,395
2
26
39

score 1 · Answer 3 · answered May 11 '12 at 11:39

I'd run the SOME_VERY_LARGE_NUMBER loop 50 times and calculate the average of the best-performing loop. This is what's usually done in other benchmarks, not just code micro-benchmarks.

I'd also argue that GC-induced performance problems are often part of the code. You probably shouldn't take the GC out of the equation, because a routine that allocates a lot of memory should have to pay a certain price benchmark-wise for that. The proposed approach factores in an average GC cost per call if you chose your SOME_VERY_LARGE_NUMBER large enough.

About your suggestion: All timers have a limited precision, so it may well be that a short routine completes within zero clock ticks. This means that your algorithm would find that the routine runs in zero time. Which is obviously not correct.

Re timer precision: After warm up, I run the measure method a couple of times until I find out how often I have to run it to get a runtime > 10'000ns. — Aaron Digulla, May 11 '12 at 13:24

Efficient performance measurement

3 Answers3