Why is performance of a simple check encapsulated in a function two times more efficient than a direct check?

Question

Below are two versions of code calculating the number of triples in an array adding up to zero. One uses a function call to make the actual test, and the other performs the test in the body of the function.

It exhibits an interesting behaviour in terms of performance time. The variant using function call performs two times faster. Why?

/**
 * Find triples of integers, which add up to zero
 */
public class SumCheck {

    public static void main(String[] args) {

        int a = 1000000;
        int b = 3000;
        int[] input = new int[b];
        for (int i = 0; i < b; i++) {
            input[i] = StdRandom.uniform(-a, a);
        }

        double startTime2 = System.currentTimeMillis() / 1000.0;
        int counter2 = count2(input);
        double endTime2 = System.currentTimeMillis() / 1000.0;
        System.out.printf("%d +(%.0f seconds)\n", counter2, endTime2 - startTime2);


        double startTime = System.currentTimeMillis() / 1000.0;
        int counter = count(input);
        double endTime = System.currentTimeMillis() / 1000.0;
        System.out.printf("%d +(%.0f seconds)\n", counter, endTime - startTime);

    }

    private static int count(int[] a) {
        int counter = 0;
        for (int i = 0; i < a.length; i++) {
            for (int j = i + 1; j < a.length; j++) {
                for (int k = j + 1; k < a.length; k++) {
                    if (a[i] + a[j] + a[k] == 0)
                        counter++;
                }
            }
        }
        return counter;
    }

    // same as count function but comparison is being done through a function call
    private static int count2(int[] a) {
        int counter = 0;
        for (int i = 0; i < a.length; i++) {
            for (int j = i + 1; j < a.length; j++) {
                for (int k = j + 1; k < a.length; k++) {
                    counter = counter + check(a, i, j, k);
                }
            }
        }
        return counter;
    }

    private static int check(int[] a, int i, int j, int k) {
        if (a[i] + a[j] + a[k] == 0) {
            return 1;
        }
        return 0;
    }
}

In particular, one of the runs yields the following times: 12 seconds, 33 seconds.

It is a bit unpredictable what the JIT (just in time) compiler is doing. Maybe it has compiled the method since it was used so often. — Henry, Jul 06 '20 at 08:21
https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java - read this on how to write micro-benchmarks. — Amongalen, Jul 06 '20 at 08:32
If you call the `check` method enough, it will get inlined so it's unlikely you will find a diff between the 2 `count` method with a proper benchmark. It would also help to monitor which method gets compiled and when. — assylias, Jul 06 '20 at 08:38
Thank you! I know, this is not a perfect benchmark, but just simple swapping of these two executions shows that the twofold difference still persists. So, the problem must probably lie with the way the code is optimized by Java. — Pavlo Maistrenko, Jul 06 '20 at 09:23
It seems strange. Where in C you have to use `inline` to have a function optimized (by having its body embedded into the calling function's body,) this is just behaves to the contrary. — Pavlo Maistrenko, Jul 06 '20 at 09:27
@PavloMaistrenko not really, in C everything is compiled down to machine code so inlining saves the function call overhead. In Java it may well be that the big method is interpreted and just the little often used method gets compiled. The JIT trades compilation time against run time. If something is executed just once there is not much to gain compiling it. — Henry, Jul 06 '20 at 09:51
The difference seems to disappear if `check` returns a boolean instead of an int. Check [this](https://stackoverflow.com/q/62412194/13149581) question for a probably similar case with a detailed answer. — , Jul 09 '20 at 16:44

score 1 · Answer 1 · answered Jul 17 '20 at 15:40

It's not.

If you modify

if(a[i] + a[j] + a[k] == 0) {
  counter++
}

to

counter = counter + (a[i]+b[i]+c[i] == 0 ? 1 : 0)

i.e. inline check into count2 by hand to create count, both variants take exactly the same amount of time.

So, why might adding 1 or 0 be faster than if and increment?

This question has a similar case and a very detailed answer. It claims that java is transforming these ternary expressions of two integers into a special operation on the cpu, CMOV. CMOV wins compared to a normal if - else construct if the jump can't be predicted properly. We can check this thesis by making the job easier for our branch predictor: Instead of using a=1000000, we use a=0. Now, the if will be taken all the time, and both methods are equally fast.

However, if we use an even larger a, a=800_000_000, there are single-digit cases where a[i] + a[j] + a[k] == 0 holds, but if increment is still about factor 2 slower. That's very surprising to me because if only 10 out of 3000**3 branches are taken, the branch predictor should be quite good at predicting these as not-taken and 10 outliers can't slow the process down that much. However, I don't know enough about java and java benchmarks to investigate that further.

score 0 · Answer 2 · answered Jul 25 '20 at 21:02

I tried following the same steps of an already linked question, which lead me to the conclusion that we are dealing with a different situation. Since I am new to JMH and this case seems to require a more careful handling, I'll try to post enough code to make it easier to spot eventual flaws. The most important and relevant bit of information I came across in the process is that you should avoid using loops in benchmarks unless you are measuring the loop itself.

Here's an attempt at a better benchmark, I added an inlined version and the base method for a rough estimate of the random number generation overhead:

    package org.sample;                                   
    import java.util.*;
    import java.util.concurrent.*;
    import org.openjdk.jmh.annotations.*;
     
    @State(Scope.Thread)
    public class MyBenchmark {
        public static int counter1=0;
        public static int counter2=0;
        public static int counter3=0;
          
        @Benchmark
        public static int[] base() {
            int[] input=new int[3];
            for(int i=0;i<3;i++){
                input[i]=ThreadLocalRandom.current().nextInt(1000000*2) - 1000000;
            }
            return input;
        }
    
    
        @Benchmark
        public static int ifInc() {
            int[] input=new int[3];
            for(int i=0;i<3;i++){
                input[i]=ThreadLocalRandom.current().nextInt(1000000*2) - 1000000;
            }
            if (input[0] + input[1] + input[2] == 0){
                counter1++;
            }
            return counter1;
        }
    
        @Benchmark
        public static int method() {
            int[] input=new int[3];
            for(int i=0;i<3;i++){
                input[i]=ThreadLocalRandom.current().nextInt(1000000*2) - 1000000;
            }
            counter2 = counter2 + check(input, 0, 1, 2);
            return counter2;
        }
    
        @Benchmark
        public static int inline() {
            int[] input=new int[3];
            for(int i=0;i<3;i++){
                input[i]=ThreadLocalRandom.current().nextInt(1000000*2) - 1000000;
            }
            counter3 = counter3 + (input[0]+input[1]+input[2] == 0 ? 1 : 0);
            return counter3;
        }
    
    
        public static int check(int[] a, int i, int j, int k) {
            if (a[i] + a[j] + a[k] == 0) {
                return 1;
            }
            return 0;
        }
    }

    Benchmark           Mode  Cnt   Score   Error  Units
    MyBenchmark.base    avgt   25  19,567 ? 0,077  ns/op
    MyBenchmark.ifInc   avgt   25  20,461 ? 0,087  ns/op
    MyBenchmark.inline  avgt   25  22,265 ? 0,614  ns/op
    MyBenchmark.method  avgt   25  22,018 ? 0,143  ns/op

And this is an attempt at getting rid of the overhead, while still avoiding loop optimisations, with somewhat consistent results:

    package org.sample;
    import java.util.*;
    import java.util.concurrent.*;
    import org.openjdk.jmh.annotations.*;
    import org.openjdk.jmh.infra.Blackhole;
    
    @State(Scope.Thread)
    public class MyLoopBenchmark {
        Random random=new Random();
        static int a = 1000000;
        static int b = 300;
        static int counter1 = 0;
        static int counter2 = 0;
        static int counter3 = 0;
        static int[] input = new int[b];
        @Setup(Level.Iteration)
        public void prepare() {
            for (int q = 0; q < b; q++) {
                input[q] = a-random.nextInt(a*2);
            }
        }
        public static int ifInc(int i,int j,int k) {
            if (i+j+k == 0){
                counter1++;
            }
            return counter1;
        }
    
        public static int method(int i,int j,int k) {
            counter2 = counter2 + check(i, j, k);
            return counter2;
        }
    
        public static int inline(int i,int j,int k) {
            counter3 = counter3 + (i+j+k == 0 ? 1 : 0);
            return counter3;
        }
    
    
        public static int check(int i, int j, int k) {
           if (i + j + k == 0) {
                return 1;
            }
            return 0;
        }
        @Benchmark
        public void measureIf(Blackhole bh) {
            for (int i = 0; i < input.length; i++) {
                for (int j = 0; j < input.length; j++) {
                    for (int k = 0; k < input.length; k++) {
                        bh.consume(ifInc(input[i],input[j],input[k]));
                    }
                }
            }
        }
        @Benchmark
        public void measureMethod(Blackhole bh) {
            for (int i = 0; i < input.length; i++) {
                for (int j = 0; j < input.length; j++) {
                    for (int k = 0; k < input.length; k++) {
                        bh.consume(method(input[i],input[j],input[k]));
                    }
                }
            }
        }
        @Benchmark
        public void measureInline(Blackhole bh) {
            for (int i = 0; i < input.length; i++) {
                for (int j = 0; j < input.length; j++) {
                    for (int k = 0; k < input.length; k++) {
                        bh.consume(inline(input[i],input[j],input[k]));
                    }
                }
            }
        }
    }

    Benchmark                      Mode  Cnt    Score   Error  Units
    MyLoopBenchmark.measureIf      avgt   25  123,262 ? 0,660  ms/op
    MyLoopBenchmark.measureInline  avgt   25  139,877 ? 0,447  ms/op
    MyLoopBenchmark.measureMethod  avgt   25  140,355 ? 0,482  ms/op

Some attempts at measuring a whole loop showed the ifInc method containing loop being slower in the first iterations, until an impressive and unpredictable optimisation took place, while the other two loops took a more constant time. The results where quite sensible to seemingly unimportant changes and even a perfectly designed benchmark of this kind would be misleading without some clearly defined conditions.

Why is performance of a simple check encapsulated in a function two times more efficient than a direct check?

2 Answers2