The TL;DR version, for those who don't want the background, is the following specific question:
Question
Why doesn't Java have an implementation of true multidimensional arrays? Is there a solid technical reason? What am I missing here?
Background
Java has multidimensional arrays at the syntax level, in that one can declare
int[][] arr = new int[10][10];
but it seems that this is really not what one might have expected. Rather than having the JVM allocate a contiguous block of RAM big enough to store 100 int
s, it comes out as an array of arrays of int
s: so each layer is a contiguous block of RAM, but the thing as a whole is not. Accessing arr[i][j]
is thus rather slow: the JVM has to
- find the
int[]
stored atarr[i]
; - index this to find the
int
stored atarr[i][j]
.
This involves querying an object to go from one layer to the next, which is rather expensive.
Why Java does this
At one level, it's not hard to see why this can't be optimised to a simple scale-and-add lookup even if it were all allocated in one fixed block. The problem is that arr[3]
is a reference all of its own, and it can be changed. So although arrays are of fixed size, we could easily write
arr[3] = new int[11];
and now the scale-and-add is screwed because this layer has grown. You'd need to know at runtime whether everything is still the same size as it used to be. In addition, of course, this will then get allocated somewhere else in RAM (it'll have to be, since it's bigger than what it's replacing), so it's not even in the right place for scale-and-add.
What's problematic about it
It seems to me that this is not ideal, and that for two reasons.
For one, it's slow. A test I ran with these methods for summing the contents of a single dimensional or multidimensional array took nearly twice as long (714 seconds vs 371 seconds) for the multidimensional case (an int[1000000]
and an int[100][100][100]
respectively, filled with random int
values, run 1000000 times with warm cache).
public static long sumSingle(int[] arr) {
long total = 0;
for (int i=0; i<arr.length; i++)
total+=arr[i];
return total;
}
public static long sumMulti(int[][][] arr) {
long total = 0;
for (int i=0; i<arr.length; i++)
for (int j=0; j<arr[0].length; j++)
for (int k=0; k<arr[0][0].length; k++)
total+=arr[i][j][k];
return total;
}
Secondly, because it's slow, it thereby encourages obscure coding. If you encounter something performance-critical that would be naturally done with a multidimensional array, you have an incentive to write it as a flat array, even if that makes the unnatural and hard to read. You're left with an unpalatable choice: obscure code or slow code.
What could be done about it
It seems to me that the basic problem could easily enough be fixed. The only reason, as we saw earlier, that it can't be optimised is that the structure might change. But Java already has a mechanism for making references unchangeable: declare them as final
.
Now, just declaring it with
final int[][] arr = new int[10][10];
isn't good enough because it's only arr
that is final
here: arr[3]
still isn't, and could be changed, so the structure might still change. But if we had a way of declaring things so that it was final
throughout, except at the bottom layer where the int
values are stored, then we'd have an entire immutable structure, and it could all be allocated as one block, and indexed with scale-and-add.
How it would look syntactically, I'm not sure (I'm not a language designer). Maybe
final int[final][] arr = new int[10][10];
although admittedly that looks a bit weird. This would mean: final
at the top layer; final
at the next layer; not final
at the bottom layer (else the int
values themselves would be immutable).
Finality throughout would enable the JIT compiler to optimise this to give performance to that of a single dimensional array, which would then take away the temptation to code that way just to get round the slowness of multidimensional arrays.
(I hear a rumour that C# does something like this, although I also hear another rumour that the CLR implementation is so bad that it's not worth having... perhaps they're just rumours...)
Question
So why doesn't Java have an implementation of true multidimensional arrays? Is there a solid technical reason? What am I missing here?
Update
A bizarre side note: the difference in timings drops away to only a few percent if you use an int
for the running total rather than a long
. Why would there be such a small difference with an int
, and such a big difference with a long
?
Benchmarking code
Code I used for benchmarking, in case anyone wants to try to reproduce these results:
public class Multidimensional {
public static long sumSingle(final int[] arr) {
long total = 0;
for (int i=0; i<arr.length; i++)
total+=arr[i];
return total;
}
public static long sumMulti(final int[][][] arr) {
long total = 0;
for (int i=0; i<arr.length; i++)
for (int j=0; j<arr[0].length; j++)
for (int k=0; k<arr[0][0].length; k++)
total+=arr[i][j][k];
return total;
}
public static void main(String[] args) {
final int iterations = 1000000;
Random r = new Random();
int[] arr = new int[1000000];
for (int i=0; i<arr.length; i++)
arr[i]=r.nextInt();
long total = 0;
System.out.println(sumSingle(arr));
long time = System.nanoTime();
for (int i=0; i<iterations; i++)
total = sumSingle(arr);
time = System.nanoTime()-time;
System.out.printf("Took %d ms for single dimension\n", time/1000000, total);
int[][][] arrMulti = new int[100][100][100];
for (int i=0; i<arrMulti.length; i++)
for (int j=0; j<arrMulti[i].length; j++)
for (int k=0; k<arrMulti[i][j].length; k++)
arrMulti[i][j][k]=r.nextInt();
System.out.println(sumMulti(arrMulti));
time = System.nanoTime();
for (int i=0; i<iterations; i++)
total = sumMulti(arrMulti);
time = System.nanoTime()-time;
System.out.printf("Took %d ms for multi dimension\n", time/1000000, total);
}
}