Declaring multiple arrays with 64 elements 1000 times faster than declaring array of 65 elements

Question

Recently I noticed declaring an array containing 64 elements is a lot faster (>1000 fold) than declaring the same type of array with 65 elements.

Here is the code I used to test this:

public class Tests{
    public static void main(String args[]){
        double start = System.nanoTime();
        int job = 100000000;//100 million
        for(int i = 0; i < job; i++){
            double[] test = new double[64];
        }
        double end = System.nanoTime();
        System.out.println("Total runtime = " + (end-start)/1000000 + " ms");
    }
}

This runs in approximately 6 ms, if I replace new double[64] with new double[65] it takes approximately 7 seconds. This problem becomes exponentially more severe if the job is spread across more and more threads, which is where my problem originates from.

This problem also occurs with different types of arrays such as int[65] or String[65]. This problem does not occur with large strings: String test = "many characters";, but does start occurring when this is changed into String test = i + "";

I was wondering why this is the case and if it is possible to circumvent this problem.

http://en.wikipedia.org/wiki/CPU_cache - 64 elements will cause less cache invalidations than 65 elements (due to cache line sizes). — Oded, Sep 15 '13 at 08:40
Off-note: `System.nanoTime()` should be preferred over `System.currentTimeMillis()` for benchmarking. — rocketboy, Sep 15 '13 at 08:41
@Oded: 64 doubles will not fit in a cache line. Besides which, what needs to be cached in the OP's code? — Oliver Charlesworth, Sep 15 '13 at 08:41
@OliCharlesworth - Still, the issue here is one of cache invalidation, no? — Oded, Sep 15 '13 at 08:42
@Oded: From the performance behaviour, it sort of sounds like it (although 1000x is a very very big discrepancy). But I can't see anything that actually requires caching, though. Memory is being allocated, but nothing actually done with it. — Oliver Charlesworth, Sep 15 '13 at 08:42
I am just curious ? Are you under Linux ? Does the behaviour change with OS ? — bsd, Sep 15 '13 at 08:44
I can confirm this behavior on a Windows 7 64-bit machine running Java 7u40. 8 ms for 64 elements (7 ms for 63) but 29 seconds (29016 ms) when declaring 65 doubles. Timed using System.nanoTime(). — initramfs, Sep 15 '13 at 08:48
@bsd I have tested this in both Windows 7 and Ubuntu and this problem occurs in both. — Sipko, Sep 15 '13 at 08:49
FWIW, I see similar performance discrepancies if I run this code with `byte` instead of `double`. — Oliver Charlesworth, Sep 15 '13 at 08:51
I downvoted, because his benchmarking method is flawed. If I use a benchmarking tool like caliper, there is just a 2ns difference between allocating these arrays. — Thomas Jungblut, Sep 15 '13 at 08:53
@ThomasJungblut: So what explains the discrepancy in the OP's experiment? — Oliver Charlesworth, Sep 15 '13 at 08:54
What about the garbage collector? It is possible that it is in play here, but only when using 65 elements? — RaphMclee, Sep 15 '13 at 08:56
GC? My guess it has a different behaviour for the memory *allocator* — Karoly Horvath, Sep 15 '13 at 08:58
@OliCharlesworth look at the output of `-XX:+PrintCompilation` it is immediately visible what optimization was done a lot later than in the other case ;) — Thomas Jungblut, Sep 15 '13 at 09:00
@ThomasJungblut: Sure, but what is causing that behavioural difference in the JVM? — Oliver Charlesworth, Sep 15 '13 at 09:08
After adding `test[0] = Math.random();`, I see no difference — Prince John Wesley, Sep 15 '13 at 09:25
@OliCharlesworth difficult to know as we know nothing about the hotspot code. However I think that the 64*8 byte array allocation is some kind of threshold in the "worthiness" function of removal. — Thomas Jungblut, Sep 15 '13 at 09:26
@RaphMclee It seems you are correct that when I use double[65] it calls the garbage collector approximately 18 times (I used -verbose:gc) whereas it never does this when using double[64]. There may be an issue here as calling the garbage collector 18 times should not take 5 seconds. — Sipko, Sep 15 '13 at 10:49
It seems like in Windows x86 both versions take ~7 seconds, while in Windows x64 one takes 6 ms and the other takes 7 sec. — Alon Gubkin, Sep 15 '13 at 11:03
Also, the assembly generated by both versions is identical (`-XX:PrintAssembly`) — Alon Gubkin, Sep 15 '13 at 11:05
@RohitJain - Probably because it got far too many upvotes for such a poorly-constructed benchmark. — Hot Licks, Sep 15 '13 at 13:27
"There may be an issue here as calling the garbage collector 18 times should not take 5 seconds." Why not??? It needs to collect a million objects. — Hot Licks, Sep 15 '13 at 13:28
Minor point: *Declaring* an array takes essentially zero time. *Allocating* an array (or any Java object) is relatively expensive. — Hot Licks, Sep 15 '13 at 14:15
@Hot Licks If I add up the time in the print statements from the garbage collector it does not add up to more then 100 ms. — Sipko, Sep 15 '13 at 14:53

nosid · Accepted Answer · 2013-09-19T17:44:28.457

89

You are observing a behavior that is caused by the optimizations done by the JIT compiler of your Java VM. This behavior is reproducible triggered with scalar arrays up to 64 elements, and is not triggered with arrays larger than 64.

Before going into details, let's take a closer look at the body of the loop:

double[] test = new double[64];

The body has no effect (observable behavior). That means it makes no difference outside of the program execution whether this statement is executed or not. The same is true for the whole loop. So it might happen, that the code optimizer translates the loop to something (or nothing) with the same functional and different timing behavior.

For benchmarks you should at least adhere to the following two guidelines. If you had done so, the difference would have been significantly smaller.

Warm-up the JIT compiler (and optimizer) by executing the benchmark several times.
Use the result of every expression and print it at the end of the benchmark.

Now let's go into details. Not surprisingly there is an optimization that is triggered for scalar arrays not larger than 64 elements. The optimization is part of the Escape analysis. It puts small objects and small arrays onto the stack instead of allocating them on the heap - or even better optimize them away entirely. You can find some information about it in the following article by Brian Goetz written in 2005:

Urban performance legends, revisited: Allocation is faster than you think, and getting faster

The optimization can be disabled with the command line option -XX:-DoEscapeAnalysis. The magic value 64 for scalar arrays can also be changed on the command line. If you execute your program as follows, there will be no difference between arrays with 64 and 65 elements:

java -XX:EliminateAllocationArraySizeLimit=65 Tests

Having said that, I strongly discourage using such command line options. I doubt that it makes a huge difference in a realistic application. I would only use it, if I would be absolutely convinced of the necessity - and not based on the results of some pseudo benchmarks.

edited Sep 19 '13 at 17:44

answered Sep 15 '13 at 09:20

nosid

48,932
13
112
139

9

But why is the optimizer detecting that the array of size 64 is removable but not 65 – ug_ Sep 15 '13 at 09:30
The same way you could ask why does adding temporary reference (`test`) prevent it? Is the optimizer *that* dumb? – Karoly Horvath Sep 15 '13 at 09:34
@ns47731: Why do you care about? The important part is, that this behavior will not show up in actual code. – nosid Sep 15 '13 at 09:35
From what I saw from the output of `-XX:+LogCompilation`: escape analysis triggered and `allocation type 715` was eliminated. I'm no hotspot engineer, so maybe someone can come up with a "layman" explanation of it. – Thomas Jungblut Sep 15 '13 at 09:38
10

@nosid: Whilst the OP's code may not be realistic, it is clearly triggering an interesting/unexpected behaviour in the JVM, which may have implications in other situations. I think it's valid to ask why this is happening. – Oliver Charlesworth Sep 15 '13 at 09:46
Note that the JVM must initialize the primitive array. It could be the case that with 65, some cache gets invalidated. It is likely that the behaviour differs on different hardware with different cache sizes. – Ingo Sep 15 '13 at 10:23
@Ingo the loop gets removed, that's clear. There is no cache exhaustion involved here. – Thomas Jungblut Sep 15 '13 at 10:29
@nosid You are using an array of arrays here which creates the issue on itself (putting double[][] test = new double[0][0]; inside the loop will be slow as well). If you add "test[0]++;" to the loop I created it can not be optimized away yet the issue remains the same. – Sipko Sep 15 '13 at 10:32
1

@ThomasJungblut I dont think the loop gets removed. You can add "int total" outside the loop and add "total += test[0];" to the example above. Then printing the result you will see that total = 100 million and it stull runs in less then a second. – Sipko Sep 15 '13 at 10:35
@Sipko then please paste the generated assembly code into your question `-XX:PrintAssembly`, otherwise this is just a guessing game in the dark. Or at least the hotspot compilation `-XX:+LogCompilation`. – Thomas Jungblut Sep 15 '13 at 11:29
@nosid Thank you for your clarifying answer. Concering your comment on a real world application: This test is derived from a script I wrote that analyses files containing >100 million lines and each block of lines is send of to a different thread for analyses and the problem becomes exponentially bigger using more threads. Each thread needs a number of variables that are declared for each line. A solution is to declare the variables outside all loops/functions and reuse them (especially when using functions that you did not write yourself), but I was hoping there would be a simpler solution. – Sipko Sep 15 '13 at 12:34
1

The on stack replacement is about replacing interpreted code with compiled on the fly, instead of replacing heap allocation with stack allocation. EliminateAllocationArraySizeLimit is the limit size of arrays that are considered scalar replaceable in the escape analysis. So the main point that the effect is due to compiler optimization is correct, but it is not due to stack allocation, but due to the escape analysis phase failing to notice the allocation is not needed. – kiheru Sep 15 '13 at 13:44
2

@Sipko: You are writing that the application is not scaling with the number of threads. That's an indication, that the problem is not related to the micro optimizations you are asking about. I recommend looking at the big picture instead of the small parts. – nosid Sep 15 '13 at 13:45
@nosid Since allocating most of my objects outside the loops drastically decreased the issue in my actual code I assumed the two are related. My initial question was a multi-threaded example: http://stackoverflow.com/questions/18786705/multi-threaded-object-creation-slower-then-in-a-single-thread, but this got people to focus on other issues that I do not think were the root of the problem. I slightly edited it so it is more in line with this example now. – Sipko Sep 15 '13 at 15:08
@nosid "The body has no effect. That means it makes no difference whether this statement is executed or not. The same is true for the whole loop" ..Why do you say that the array allocation has no effect? Can you please explain? – Geek Sep 19 '13 at 03:31
@Geek: I have tweaked the wording: The statement has no _observable behavior_ and *not* executing the statement does not change the observable behavior of the program. – nosid Sep 19 '13 at 17:45

score 2 · Answer 2 · answered Sep 15 '13 at 13:29

There are any number of ways that there can be a difference, based on the size of an object.

As nosid stated, the JITC may be (most likely is) allocating small "local" objects on the stack, and the size cutoff for "small" arrays may be at 64 elements.

Allocating on the stack is significantly faster than allocating in heap, and, more to the point, stack does not need to be garbage collected, so GC overhead is greatly reduced. (And for this test case GC overhead is likely 80-90% of the total execution time.)

Further, once the value is stack-allocated the JITC can perform "dead code elimination", determine that the result of the new is never used anywhere, and, after assuring there are no side-effects that would be lost, eliminate the entire new operation, and then the (now empty) loop itself.

Even if the JITC does not do stack allocation, it's entirely possible for objects smaller than a certain size to be allocated in a heap differently (eg, from a different "space") than larger objects. (Normally this would not produce quite so dramatic timing differences, though.)

Late to this thread. Why is allocating on the stack faster than allocating on the heap? According to few articles, allocating on the heap takes ~12 instructions. There isn't much room for improvement. — Vortex, Sep 26 '16 at 20:36
@Vortex - Allocating to the stack takes 1-2 instructions. But that's to allocate an entire stack frame. The stack frame must be allocated anyway to have a register save area for the routine, so any other variables allocated at the same time are "free". And as I said, the stack requires no GC. The GC overhead for a heap item is far larger than the cost of the heap allocation operation. — Hot Licks, Sep 26 '16 at 23:03

Declaring multiple arrays with 64 elements 1000 times faster than declaring array of 65 elements

2 Answers2

Linked

Related