The performance comparison on Java8 of the code below is counter-intuitive.
import java.util.Arrays;
class Main {
interface Dgemv {
void dgemv(int n, double[] a, double[] x, double[] y);
}
static final class Dgemv1 implements Dgemv {
public void dgemv(int n, double[] a, double[] x, double[] y) {
Arrays.fill(y, 0.0);
for (int j = 0; j < n; ++j)
dgemvImpl(x[j], j * n, n, a, y);
}
private void dgemvImpl(final double xj, final int aoff,
final int n, double[] a, double[] y) {
for (int i = 0; i < n; ++i)
y[i] += xj * a[i + aoff];
}
}
static final class Dgemv2 implements Dgemv {
public void dgemv(int n, double[] a, double[] x, double[] y) {
Arrays.fill(y, 0.0);
for (int j = 0; j < n; ++j)
new DgemvImpl(x[j], j * n).dgemvImpl(n, a, y);
}
private static final class DgemvImpl {
private final double xj;
private final int aoff;
DgemvImpl(double xj, int aoff) {
this.xj = xj;
this.aoff = aoff;
}
void dgemvImpl(final int n, double[] a, double[] y) {
for (int i = 0; i < n; ++i)
y[i] += xj * a[i + aoff];
}
}
}
static long runDgemv(long niter, int n, Dgemv op) {
double[] a = new double[n * n];
double[] x = new double[n];
double[] y = new double[n];
long start = System.currentTimeMillis();
for (long i = 0; i < niter; ++i) {
op.dgemv(n, a, x, y);
}
return System.currentTimeMillis() - start;
}
static void testDgemv(long niter, int n, int mode) {
Dgemv op = null;
switch (mode) {
case 1: op = new Dgemv1(); break;
case 2: op = new Dgemv2(); break;
}
runDgemv(niter, n, op);
double sec = runDgemv(niter, n, op) * 1e-3;
double gflps = (2.0 * n * n) / sec * niter * 1e-9;
System.out.format("mode=%d,N=%d,%f sec,%f GFLPS\n", mode, n, sec, gflps);
}
public static void main(String[] args) {
int n = Integer.parseInt(args[0]);
long niter = ((long) 1L << 32) / (long) (2 * n * n);
testDgemv(niter, n, 1);
testDgemv(niter, n, 2);
}
}
The result on Java8 (1.8.0_60) And Core i5 4570 (3.2GHz) is:
$ java -server Main
mode=1,N=500,1.239000 sec,3.466102 GFLPS
mode=2,N=500,1.100000 sec,3.904091 GFLPS
And the result of the same calculation on Java7 (1.7.0_80) is:
mode=1,N=500,1.291000 sec,3.326491 GFLPS
mode=2,N=500,1.491000 sec,2.880282 GFLPS
It seems as if HotSpot optimizes functors more eagerly than on static methods, regardless of the additional complexity.
Could anyone explain why Dgemv2 runs faster?
Edit:
More precise benchmark stats from openjdk/jmh. (Thank you Kayaman for your comment)
N=500 / 1sec x 20 warming-ups / 1sec x 20 iterations (10 sets)
Java 8 (1.8.0_60)
Benchmark Mode Cnt Score Error Units
MyBenchmark.runDgemv1 thrpt 200 6965.459 ? 2.186 ops/s
MyBenchmark.runDgemv2 thrpt 200 7329.138 ? 1.598 ops/s
Java 7 (1.7.0_80)
Benchmark Mode Cnt Score Error Units
MyBenchmark.runDgemv1 thrpt 200 7344.570 ? 1.994 ops/s
MyBenchmark.runDgemv2 thrpt 200 7358.988 ? 2.189 ops/s
From these stats, it seems Java 8 HotSpot does not optimize static methods. But one more thing I noticed is that the performance is 10% better on some warming-up section. Picking up the extreme case:
N=500 / 1sec x 8 warming-ups / 1sec x 8 iterations (10 sets)
Java 8 (1.8.0_60)
Benchmark Mode Cnt Score Error Units
MyBenchmark.runDgemv1 thrpt 80 6952.315 ? 11.483 ops/s
MyBenchmark.runDgemv2 thrpt 80 7719.843 ? 66.773 ops/s
Iterations between 9 sec and 15 sec of Dgemv2 consistently outperforms the long-run average by about 5%. It seems like HotSpot does not always produce faster codes as the optimization procedure goes on.
My current guess is that the Functor object in Dgemv2 actually disturbs the HotSpot optimization procedure, resulting the faster execution code than the 'fully optimized code'.
Still I am not at all clear on why this happens. Any answers and comments are welcome.