3

The performance comparison on Java8 of the code below is counter-intuitive.

import java.util.Arrays;

class Main {
    interface Dgemv {
        void dgemv(int n, double[] a, double[] x, double[] y);
    }

    static final class Dgemv1 implements Dgemv {
        public void dgemv(int n, double[] a, double[] x, double[] y) {
            Arrays.fill(y, 0.0);
            for (int j = 0; j < n; ++j)
                dgemvImpl(x[j], j * n, n, a, y);
        }

        private void dgemvImpl(final double xj, final int aoff,
                final int n, double[] a, double[] y) {
            for (int i = 0; i < n; ++i)
                y[i] += xj * a[i + aoff];
        }
    }

    static final class Dgemv2 implements Dgemv {
        public void dgemv(int n, double[] a, double[] x, double[] y) {
            Arrays.fill(y, 0.0);
            for (int j = 0; j < n; ++j)
                new DgemvImpl(x[j], j * n).dgemvImpl(n, a, y);
        }

        private static final class DgemvImpl {
            private final double xj;
            private final int aoff;

            DgemvImpl(double xj, int aoff) {
                this.xj = xj;
                this.aoff = aoff;
            }

            void dgemvImpl(final int n, double[] a, double[] y) {
                for (int i = 0; i < n; ++i)
                    y[i] += xj * a[i + aoff];
            }
        }
    }

    static long runDgemv(long niter, int n, Dgemv op) {
        double[] a = new double[n * n];
        double[] x = new double[n];
        double[] y = new double[n];
        long start = System.currentTimeMillis();
        for (long i = 0; i < niter; ++i) {
            op.dgemv(n, a, x, y);
        }
        return System.currentTimeMillis() - start;
    }

    static void testDgemv(long niter, int n, int mode) {
        Dgemv op = null;
        switch (mode) {
        case 1: op = new Dgemv1(); break;
        case 2: op = new Dgemv2(); break;
        }
        runDgemv(niter, n, op);
        double sec = runDgemv(niter, n, op) * 1e-3;
        double gflps = (2.0 * n * n) / sec * niter  * 1e-9;
        System.out.format("mode=%d,N=%d,%f sec,%f GFLPS\n", mode, n, sec, gflps);
    }

    public static void main(String[] args) {
        int n = Integer.parseInt(args[0]);
        long niter = ((long) 1L << 32) / (long) (2 * n * n);
        testDgemv(niter, n, 1);
        testDgemv(niter, n, 2);
    }
}

The result on Java8 (1.8.0_60) And Core i5 4570 (3.2GHz) is:

$ java -server Main
mode=1,N=500,1.239000 sec,3.466102 GFLPS
mode=2,N=500,1.100000 sec,3.904091 GFLPS

And the result of the same calculation on Java7 (1.7.0_80) is:

mode=1,N=500,1.291000 sec,3.326491 GFLPS
mode=2,N=500,1.491000 sec,2.880282 GFLPS

It seems as if HotSpot optimizes functors more eagerly than on static methods, regardless of the additional complexity.

Could anyone explain why Dgemv2 runs faster?

Edit:

More precise benchmark stats from openjdk/jmh. (Thank you Kayaman for your comment)

N=500 / 1sec x 20 warming-ups / 1sec x 20 iterations (10 sets)

Java 8 (1.8.0_60)

Benchmark               Mode  Cnt     Score   Error  Units
MyBenchmark.runDgemv1  thrpt  200  6965.459 ? 2.186  ops/s
MyBenchmark.runDgemv2  thrpt  200  7329.138 ? 1.598  ops/s

Java 7 (1.7.0_80)

Benchmark               Mode  Cnt     Score   Error  Units
MyBenchmark.runDgemv1  thrpt  200  7344.570 ? 1.994  ops/s
MyBenchmark.runDgemv2  thrpt  200  7358.988 ? 2.189  ops/s

From these stats, it seems Java 8 HotSpot does not optimize static methods. But one more thing I noticed is that the performance is 10% better on some warming-up section. Picking up the extreme case:

N=500 / 1sec x 8 warming-ups / 1sec x 8 iterations (10 sets)

Java 8 (1.8.0_60)

Benchmark               Mode  Cnt     Score    Error  Units
MyBenchmark.runDgemv1  thrpt   80  6952.315 ? 11.483  ops/s
MyBenchmark.runDgemv2  thrpt   80  7719.843 ? 66.773  ops/s

Iterations between 9 sec and 15 sec of Dgemv2 consistently outperforms the long-run average by about 5%. It seems like HotSpot does not always produce faster codes as the optimization procedure goes on.

My current guess is that the Functor object in Dgemv2 actually disturbs the HotSpot optimization procedure, resulting the faster execution code than the 'fully optimized code'.

Still I am not at all clear on why this happens. Any answers and comments are welcome.

Free-Minded
  • 5,322
  • 6
  • 50
  • 93
  • First you'll need to make a proper benchmark. See [Java Microbenchmarking Harness](http://openjdk.java.net/projects/code-tools/jmh/). – Kayaman Aug 26 '15 at 12:37
  • I don't see much of a wrong benchmark, except the usage of `System.currentTimeMillis()` which must be changed to `System.nanoTime()`. You don't necessarily need to use jmh. – SpaceTrucker Aug 26 '15 at 12:48
  • On my machine I see a longer warmup phase on JDK 1.7.0_80 than on JDK 1.8.0_51. After warmup performance is nearly identical on both JVMs. I just repeated the contents of main 10 times to see that effect. However the comparisons are similiar with mode 2 being a bit faster than mode 1. – SpaceTrucker Aug 26 '15 at 13:08
  • Thanks for commnets. Dgemv1 performance seems to deteriorate after longer warm up run on JDK 7... I will take a benchmark with jmh. – tatami goza Aug 26 '15 at 13:12
  • Seems like the same issue as [here](http://stackoverflow.com/q/30454904/3448419). `-XX:LoopUnrollLimit=100` makes difference. – apangin Aug 26 '15 at 21:56
  • Thanks apangin. It might be the answer. I will examine it with jmh again. – tatami goza Aug 27 '15 at 02:30

1 Answers1

0

As in apangin's comment, it is from loop unrolling optimization of HotSpot.

Taking the same benchmark with changing the LoopUnrollLimit parameter by jvm's option '-XX:LoopUnrollLimit=??' (the default value=60 in 64bit x86), the optimization in the benchmark above seems to be on the edge of loop unrolling decision.

Java 8 (1.8.0_60)

LoopUnrollLimit |       30|       60|       90|      120|      240|
-------------------------------------------------------------------
Dgemv1 (ops/s)  | 6322.156| 6967.511| 7632.563| 8811.307| 8811.552|
Dgemv2 (ops/s)  | 5631.505| 7328.529| 7689.380| 8774.118| 8780.631|

Java 7 (1.7.0_80)

LoopUnrollLimit |       30|       60|       90|      120|      240|
-------------------------------------------------------------------
Dgemv1 (ops/s)  | 6309.920| 7345.467| 7611.573| 8805.658| 8795.785|
Dgemv2 (ops/s)  | 5632.850| 7345.467| 7658.165| 8758.698| 8762.981|

From openjdk's hotspot source code, the LoopUnrollLimit parameter is compared against size of the node list, that is the internal representation of the execution code inside the target loop (as in here).

Considering that jvm7 and jvm8 is not different in the case of full optimization (>=120), it may be because the way of emitting the internal code is changed in jdk8, and it might enlarge the node list of the static method around 'LoopUnrollLimit=60'.