Java 2D array fill - innocent optimization caused terrible slowdown

Question

I've tried to optimize a filling of square two-dimensional Java array with sums of indices at each element by computing each sum once for two elements, opposite relative to the main diagonal. But instead of speedup or, at least, comparable performance, I've got 23(!) times slower code.

My code:

@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(ArrayFill.N * ArrayFill.N)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class ArrayFill {
    public static final int N = 8189;
    public int[][] g;

    @Setup
    public void setup() { g = new int[N][N]; }

    @GenerateMicroBenchmark
    public int simple(ArrayFill state) {
        int[][] g = state.g;
        for(int i = 0; i < g.length; i++) {
            for(int j = 0; j < g[i].length; j++) {
                g[i][j] = i + j;
            }
        }
        return g[g.length - 1][g[g.length - 1].length - 1];
    }

    @GenerateMicroBenchmark
    public int optimized(ArrayFill state) {
        int[][] g = state.g;
        for(int i = 0; i < g.length; i++) {
            for(int j = 0; j <= i; j++) {
                g[j][i] = g[i][j] = i + j;
            }
        }
        return g[g.length - 1][g[g.length - 1].length - 1];
    }
}

Benchmark results:

Benchmark               Mode     Mean   Mean error    Units
ArrayFill.simple        avgt    0.907        0.008    ns/op
ArrayFill.optimized     avgt   21.188        0.049    ns/op

The question:
How could so tremendous performance drop be explained?

_{P. S. Java version is 1.8.0-ea-b124, 64-bit 3.2 GHz AMD processor, benchmarks were executed in a single thread.}

You're might wanna read this: http://stackoverflow.com/questions/16699247/what-is-cache-friendly-code — Mysticial, Feb 07 '14 at 23:08
@Mysticial I don't believe cache contention could cause x23 slowdown — leventov, Feb 07 '14 at 23:10
How does the performance change if you push that `8189` further away from `8192`? — Mysticial, Feb 07 '14 at 23:22
In your "optimized" version you reduce the amount of time that you compare and iterate, but you increase (by a lot) your array accesses, wich, out of the 3 is the most expensive. See: http://stackoverflow.com/questions/1428928/array-access-optimization — pedromss, Feb 07 '14 at 23:54
@Mysticial I wanted inner arrays to take exactly 8 pages, but apparently I forgot to apply `-XX:+UseCompressedOops`, and they took 8 pages + 4 bytes :( — leventov, Feb 08 '14 at 00:47

score 13 · Answer 1 · edited Jun 20 '20 at 09:12

A side note: Your "optimized" version mightn't be faster at all, even when we leave all possible problems aside. There are multiple resources in a modern CPU and saturating one of them may stop you from any improvements. What I mean: The speed may be memory-bound, and trying to write twice as fast may in one iteration may change nothing at all.

I can see three possible reasons:

Your access pattern may enforce bound checks. In the "simple" loop they can be obviously eliminated, in the "optimized" only if the array is a square. It is, but this information is available only outside of the method (moreover a different piece of code could change it!).
The memory locality in your "optimized" loop is bad. It accesses essentially random memory locations as there's nothing like a 2D array in Java (only an array of arrays for which new int[N][N] is a shortcut). When iterating column-wise, you use only a single int from each loaded cacheline, i.e., 4 bytes out of 64.
The memory prefetcher can have a problem with your access pattern. The array with its 8189 * 8189 * 4 bytes is too big to fit in any cache. Modern CPUs have a prefetcher allowing to load a cache line during in advance, when it spots a regular access pattern. The capabilities of the prefetchers vary a lot. This might be irrelevant here, as you're only writing, but I'm not sure if it's possible to write into a cache-line which hasn't been fetched.

I guess the memory locality is the main culprit:

I added a method "reversed" which works juts like simple, but with

g[j][i] = i + j;

instead of

g[i][j] = i + j;

This "innocuous" change is a performance desaster:

Benchmark                                Mode   Samples         Mean   Mean error    Units
o.o.j.s.ArrayFillBenchmark.optimized     avgt        20       10.484        0.048    ns/op
o.o.j.s.ArrayFillBenchmark.reversed      avgt        20       20.989        0.294    ns/op
o.o.j.s.ArrayFillBenchmark.simple        avgt        20        0.693        0.003    ns/op

Note, I don't read anything from "the big" array memory in the tight loop, therefore your 2nd and 3rd points are irrelevant. I'm not right? — leventov, Feb 08 '14 at 00:40
@leventov: You don't, but the CPU possibly has to. AFAIK all communication between the cache and memory uses cache-lines as the smallest unit. The CPU can request a specific address and get the corresponding part of the cache-line *first*, but it always gets a whole line. I guess, writing isn't more flexible. — maaartinus, Feb 08 '14 at 00:45
I might have missed the point of your comment. You do access `g[j][i]` this would ideally mean some address like `4 * (8189 * i + j)`. This is bad enough, but as I wrote, there are no 2D arrays in Java, so you basically access a random location. — maaartinus, Feb 08 '14 at 00:47
maaartinus is correct, a "store" to memory is essentially an enhanced read operation in (almost) all cases - you fetch the line, plus guarantee ownership to retain cache coherency. — Leeor, Feb 09 '14 at 22:01

score 2 · Answer 2 · answered Feb 08 '14 at 08:37

I wrote version that works faster than "simple". But, I don't know why it is faster (. Here is the code:

class A {
  public static void main(String[] args) {
    int n = 8009;

    long st, en;

    // one
    int gg[][] = new int[n][n];
    st = System.nanoTime();
    for(int i = 0; i < n; i++) {
      for(int j = 0; j < n; j++) {
        gg[i][j] = i + j; 
      }
    }
    en = System.nanoTime();

    System.out.println("\nOne time " + (en - st)/1000000.d + " msc");

    // two
    int g[][] = new int[n][n];
    st = System.nanoTime();
    int odd = (n%2), l=n-odd;
    for(int i = 0; i < l; ++i) {
      int t0, t1;   
      int a0[] = g[t0 = i];
      int a1[] = g[t1 = ++i];
      for(int j = 0; j < n; ++j) {
        a0[j] = t0 + j;
        a1[j] = t1 + j;
      }
    }
    if(odd != 0)
    {
      int i = n-1;
      int a[] = g[i];
      for(int j = 0; j < n; ++j) {
        a[j] = i + j;
      }
    }
    en = System.nanoTime();
    System.out.println("\nOptimized time " + (en - st)/1000000.d + " msc");

    int r = g[0][0]
    //  + gg[0][0]
    ;
    System.out.println("\nZZZZ = " + r);

  }
}

The results are:

One time 165.177848 msc

Optimized time 99.536178 msc

ZZZZ = 0

Can someone explain me we why it is faster?

I see, you allocated a new array for the second run, but still, did you try changing the order of "unoptimized" and "optimized" runs? — fiktor, Feb 08 '14 at 15:41
Yes, it differs a little, but still the second version is better. — Chen Gupta, Feb 08 '14 at 15:53

VoidVolker · Answer 3 · 2014-02-08T15:21:10.360

http://www.learn-java-tutorial.com/Arrays.cfm#Multidimensional-Arrays-in-Memory

Picture: http://www.learn-java-tutorial.com/images/4715/Arrays03.gif

int[][] === array of arrays of values

int[] === array of values

class A {
    public static void main(String[] args) {
        int n = 5000;

        int g[][] = new int[n][n];
        long st, en;

        // one
        st = System.nanoTime();
        for(int i = 0; i < n; i++) {
            for(int j = 0; j < n; j++) {
                g[i][j] = 10; 
            }
        }
        en = System.nanoTime();
        System.out.println("\nTwo time " + (en - st)/1000000.d + " msc");

        // two
        st = System.nanoTime();
        for(int i = 0; i < n; i++) {
            g[i][i] =  20;
            for(int j = 0; j < i; j++) {
                g[j][i] = g[i][j] = 20; 
            }
        }
        en = System.nanoTime();
        System.out.println("\nTwo time " + (en - st)/1000000.d + " msc");

        // 3
        int arrLen = n*n;
        int[] arr = new int[arrLen];
        st = System.nanoTime();
        for(int i : arr) {
            arr[i] = 30;
        }
        en = System.nanoTime();
        System.out.println("\n3   time " + (en - st)/1000000.d + " msc");

        // 4
        st = System.nanoTime();
        int i, j;
        for(i = 0; i < n; i++) {
            for(j = 0; j < n; j++) {
                arr[i*n+j] = 40;
            }
        }
        en = System.nanoTime();
        System.out.println("\n4   time " + (en - st)/1000000.d + " msc");
    }
}

Two time 71.998012 msc

Two time 551.664166 msc

3 time 63.74851 msc

4 time 57.215167 msc

P.S. I'am not a java spec =)

score 0 · Answer 4 · answered Feb 08 '14 at 16:24

I see, you allocated a new array for the second run, but still, did you try changing the order of "unoptimized" and "optimized" runs? – fikto

I changed order of them and optimized it a little bit:

class A {
  public static void main(String[] args) {
    int n = 8009;
    double q1, q2;
    long st, en;

    // two
    int g[][] = new int[n][n];
    st = System.nanoTime();
    int odd = (n%2), l=n-odd;
    for(int i = 0; i < l; ++i) {
      int t0, t1;   
      int a0[] = g[t0 = i];
      int a1[] = g[t1 = ++i];
      for(int j = 0; j < n; ++j, ++t0, ++t1) {
        a0[j] = t0;
        a1[j] = t1;
      }
    }
    if(odd != 0)
    {
      int i = n-1;
      int a[] = g[i];
      for(int j = 0; j < n; ++j, ++i) {
        a[j] = i;
      }
    }
    en = System.nanoTime();
    System.out.println("Optimized time " + (q1=(en - st)/1000000.d) + " msc");

    // one
    int gg[][] = new int[n][n];
    st = System.nanoTime();
    for(int i = 0; i < n; i++) {
      for(int j = 0; j < n; j++) {
        gg[i][j] = i + j; 
      }
    }
    en = System.nanoTime();

    System.out.println("One time " + (q2=(en - st)/1000000.d) + " msc");

    System.out.println("1 - T1/T2 = " + (1 - q1/q2));

  }
}

And results are:

Optimized time 99.360293 msc
One time 162.23607 msc
1 - T1/T2 = 0.3875573231033026

It's better to edit the original answer and notify fictor using mention in comments: @fictor — leventov, Feb 08 '14 at 16:27

Java 2D array fill - innocent optimization caused terrible slowdown

4 Answers4

I guess the memory locality is the main culprit: