7

I have this following code

public class BenchMark {
    public static void main(String args[]) {
        doLinear();

        doLinear();

        doLinear();

        doLinear();

    }


    private static void doParallel() {
        IntStream range = IntStream.range(1, 6).parallel();

        long startTime = System.nanoTime();
        int reduce = range
                .reduce((a, item) -> a * item).getAsInt();
        long endTime = System.nanoTime();
        System.out.println("parallel: " +reduce + " -- Time: " + (endTime - startTime));
    }

    private static void doLinear() {
        IntStream range = IntStream.range(1, 6);

        long startTime = System.nanoTime();
        int reduce = range
                .reduce((a, item) -> a * item).getAsInt();
        long endTime = System.nanoTime();
        System.out.println("linear: " +reduce + " -- Time: " + (endTime - startTime));
    }

}

I was trying to benchmark streams but came through this execution time steadily decreasing upon calling the same function again and again

Output:

linear: 120 -- Time: 57008226
linear: 120 -- Time: 23202
linear: 120 -- Time: 17192
linear: 120 -- Time: 17802

Process finished with exit code 0

There is a huge difference between first and second execution time.

I'm sure JVM might be doing some tricks behind the scenes but can anybody help me understand whats really going on there ?

Is there anyway to avoid this optimization so I can benchmark true execution time ?

Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
Siva
  • 7,780
  • 6
  • 47
  • 54
  • Look at [this question](http://stackoverflow.com/q/504103/2670892) for discussions of micro-benchmarking – greg-449 Aug 16 '15 at 08:36
  • @greg-449 pretty good guide to benchmark. But I still have trouble understanding the cause any docs regarding this optimisation would help. – Siva Aug 16 '15 at 10:31
  • And also when I increased the range to a very large number then I don't see huge execution time difference. Seems this happens for significant range. – Siva Aug 16 '15 at 10:34

4 Answers4

7

I'm sure JVM might be doing some tricks behind the scenes but can anybody help me understand whats really going on there?

  1. The massive latency of the first invocation is due to the initialization of the complete lambda runtime subsystem. You pay this only once for the whole application.

  2. The first time your code reaches any given lambda expression, you pay for the linkage of that lambda (initialization of the invokedynamic call site).

  3. After some iterations you'll see additional speedup due to the JIT compiler optimizing your reduction code.

Is there anyway to avoid this optimization so I can benchmark true execution time?

You are asking for a contradiction here: the "true" execution time is the one you get after warmup, when all optimizations have been applied. This is the runtime an actual application would experience. The latency of the first few runs is not relevant to the wider picture, unless you are interested in single-shot performance.

For the sake of exploration you can see how your code behaves with JIT compilation disabled: pass -Xint to the java command. There are many more flags which disable various aspects of optimization.

Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436
  • Don't you always have to compile your code to run it? If i build my project into a runnable jar would this be considered an 'actual application'? And even with all optimizations applied, an application would still have to pay the 'linkage'-cost you talk of, correct? – user2651804 Jul 05 '17 at 22:56
  • 1. You have to compile to bytecode, not machine code. 2. The 'actual application' is what you actually live with, not some out-of-context benchmark code. 3. Lambda linkage happens the first time your application executes a part of code containing the lambda expression. JIT compiler optimizations are applied after linkage is done. – Marko Topolnik Jul 06 '17 at 05:28
3

UPDATE: Refer @Marko's answer for an explanation of the initial latency due to lambda linkage.


The higher execution time for the first call is probably a result of the JIT effect. In short, the JIT compilation of the byte codes into native machine code occurs during the first time your method is called. The JVM then attempts further optimization by identifying frequently-called (hot) methods, and re-generate their codes for higher performance.

Is there anyway to avoid this optimization so I can benchmark true execution time ?

You can certainly account for the JVM initial warm-up by excluding the first few result. Then increase the number of repeated calls to your method in a loop of tens of thousands of iterations, and average the results.

There are a few more options that you might want to consider adding to your execution to help reduce noises as discussed in this post. There are also some good tips from this post too.

Community
  • 1
  • 1
ivan.sim
  • 8,972
  • 8
  • 47
  • 63
  • when I increased upper bound for range from 6 to huge number there is no optimiztion process. Seems this is happening for significant range. – Siva Aug 16 '15 at 10:39
  • 1
    No, the first run is slower due to lambda linkage (initialization of the `invokedynamic` call site) plus the initialization of the complete `LambdaMetaFactory` subsystem. Later speedups are due to JITting. There is no JIT compliation of code entered for the first time. – Marko Topolnik Aug 16 '15 at 11:36
  • Also keep in mind that JIT compilation is an asynchronous process, thus not contributing to the latency of a method invocation. – Marko Topolnik Aug 16 '15 at 11:47
  • @MarkoTopolnik Thanks for the feedback. I hae updated my answer to refer to yours. – ivan.sim Aug 16 '15 at 14:26
3

true execution time

There's no thing like "true execution time". If you need to solve this task only once, the true execution time would be the time of the first test (along with time to startup the JVM itself). In general the time spent for execution of given piece of code depends on many things:

  • Whether this piece of code is interpreted, JIT-compiled by C1 or C2 compiler. Note that there are not just three options. If you call one method from another, one of them might be interpreted and another might be C2-compiled.

  • For C2 compiler: how this code was executed previously, so what's in branch and type profile. The polluted type profile can drastically reduce the performance.

  • Garbage collector state: whether it interrupts the execution or not

  • Compilation queue: whether JIT-compiler compiles other code simultaneously (which may slow down the execution of current code)

  • The memory layout: how objects located in the memory, how many cache lines should be loaded to access all the necessary data.

  • CPU branch predictor state which depends on the previous code execution and may increase or decrease number of branch mispredictions.

And so on and so forth. So even if you measure something in the isolated benchmark, this does not mean that the speed of the same code in the production will be the same. It may differ in the order of magnitude. So before measuring something you should ask yourself why you want to measure this thing. Usually you don't care how long some part of your program is executed. What you usually care is the latency and the throughput of the whole program. So profile the whole program and optimize the slowest parts. Probably the thing you are measuring is not the slowest.

Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
1

Java VM loads a class into memory first time the class is used. So the difference between 1st and 2nd run may be caused by class loading.

tkob
  • 113
  • 6