Why does Java HotSpot can not optimize array instance after one-time resizing (leads to massive performance loss)?

Question

Question

Why is the use of fBuffer1 in the attached code example (SELECT_QUICK = true) double as fast as the other variant when fBuffer2 is resized only once at the beginning (SELECT_QUICK = false)?

The code path is absolutely identical but even after 10 minutes the throughput of fBuffer2 does not increase to this level of fBuffer1.

Background:

We have a generic data processing framework that collects thousands of Java primitive values in different subclasses (one subclass for each primitive type). These values are stored internally in arrays, which we originally sized sufficiently large. To save heap memory, we have now switched these arrays to dynamic resizing (arrays grow only if needed). As expected, this change has massively reduce the heap memory. However, on the other hand the performance has unfortunately degenerated significantly. Our processing jobs now take 2-3 times longer as before (e.g. 6 min instead of 2 min as before).

I have reduced our problem to a minimum working example and attached it. You can choose with SELECT_QUICK which buffer should be used. I see the same effect with jdk-1.8.0_202-x64 as well as with openjdk-17.0.1-x64.

Buffer 1 (is not resized) shows the following numbers:

duration buf1: 8,890.551ms (8.9s)
duration buf1: 8,339.755ms (8.3s)
duration buf1: 8,620.633ms (8.6s)
duration buf1: 8,682.809ms (8.7s)
...

Buffer 2 (is resized exactly 1 time at the beginning) shows the following numbers:

make buffer 2 larger
duration buf2 (resized): 19,542.750ms (19.5s)
duration buf2 (resized): 22,423.529ms (22.4s)
duration buf2 (resized): 22,413.364ms (22.4s)
duration buf2 (resized): 22,219.383ms (22.2s)
...

I would really appreciate to get some hints, how I can change the code so that fBuffer2 (after resizing) works as fast as fBuffer1. The other way round (make fBuffer1 as slow as fBuffer2) is pretty easy. ;-) Since this problem is placed in some framework-like component I would prefer to change the code instead of tuning the hotspot (with external arguments). But of course, suggestions in both directions are very welcome.

Source Code

import java.util.Locale;

public final class Collector {

    private static final boolean SELECT_QUICK = true;

    private static final long LOOP_COUNT = 50_000L;
    private static final int VALUE_COUNT = 150_000;
    private static final int BUFFER_LENGTH = 100_000;

    private final Buffer fBuffer = new Buffer();

    public void reset() {fBuffer.reset();}
    public void addValueBuf1(long val) {fBuffer.add1(val);}
    public void addValueBuf2(long val) {fBuffer.add2(val);}

    public static final class Buffer {

        private int fIdx = 0;
        private long[] fBuffer1 = new long[BUFFER_LENGTH * 2];
        private long[] fBuffer2 = new long[BUFFER_LENGTH];

        public void reset() {fIdx = 0;}

        public void add1(long value) {
            ensureRemainingBuf1(1);
            fBuffer1[fIdx++] = value;
        }

        public void add2(long value) {
            ensureRemainingBuf2(1);
            fBuffer2[fIdx++] = value;
        }

        private void ensureRemainingBuf1(int remaining) {
            if (remaining > fBuffer1.length - fIdx) {
                System.out.println("make buffer 1 larger");
                fBuffer1 = new long[(fIdx + remaining) << 1];
            }
        }

        private void ensureRemainingBuf2(int remaining) {
            if (remaining > fBuffer2.length - fIdx) {
                System.out.println("make buffer 2 larger");
                fBuffer2 = new long[(fIdx + remaining) << 1];
            }
        }

    }

    public static void main(String[] args) {
        Locale.setDefault(Locale.ENGLISH);
        final Collector collector = new Collector();
        if (SELECT_QUICK) {
            while (true) {
                final long start = System.nanoTime();
                for (long j = 0L; j < LOOP_COUNT; j++) {
                    collector.reset();
                    for (int k = 0; k < VALUE_COUNT; k++) {
                        collector.addValueBuf1(k);
                    }
                }
                final long nanos = System.nanoTime() - start;
                System.out.printf("duration buf1: %1$,.3fms (%2$,.1fs)%n",
                    nanos / 1_000_000d, nanos / 1_000_000_000d);
            }
        } else {
            while (true) {
                final long start = System.nanoTime();
                for (long j = 0L; j < LOOP_COUNT; j++) {
                    collector.reset();
                    for (int k = 0; k < VALUE_COUNT; k++) {
                        collector.addValueBuf2(k);
                    }
                }
                final long nanos = System.nanoTime() - start;
                System.out.printf("duration buf2 (resized): %1$,.3fms (%2$,.1fs)%n",
                    nanos / 1_000_000d, nanos / 1_000_000_000d);
            }
        }
    }

}

Is it important that `collector.reset()` only resets `fIdx`, not the size of the array? — Andy Turner, Feb 04 '22 at 13:20
Final point, have you considered saving valuable development time and used an existing primitive collection library which guarantees efficient performance? — Kayaman, Feb 04 '22 at 13:24
The source code is not meant to do something useful. It is just a highly simplified minimal example to reproduce the problem. The original problem is located in our complex processing pipeline which is an integral part of one of our software components for geospatial traffic processing. Which of course we decided to develop because it doesn't exist like this. :-) — foosmate, Feb 04 '22 at 14:29
But apart from the fact that we suffer strongly from this performance degration. The observed effect seems to be weird!? Why does HotSpot not optimize the variant with ```fBuffer2``` (even after 1h the throughput is only half as of ```fBuffer1```)? — foosmate, Feb 04 '22 at 14:30
@AndyTurner If you pull the code of ```addX()``` one level higher, then both variants perform equally bad. The reset only applies to ```fIdx``` since this can be seen as length indicator for the buffer. Furthermore the buffer itself should only grow and never shrink (except it is explicitly trimmed). It's a similar strategy as performed by ```ArrayList``` for instance. — foosmate, Feb 04 '22 at 14:32
My guess is that the code gets heavily inlined, and the compiler can only eliminate array bounds checks when it can prove that the array length is guaranteed to be constant. — boneill, Feb 04 '22 at 15:46
That's a good point. Thus you mean, as long as the array instance is not replaced the JIT compiler may simplify the array access by skipping all further bounds checks. — foosmate, Feb 04 '22 at 16:00
I do some further tests and re-assign the ```fBuffer1``` field in line 56 with a variable length, but the massive performance difference is still there. ```collector.fBuffer.fBuffer1 = new long[BUFFER_LENGTH * 2 + (int) (System.currentTimeMillis() % 2)];``` — foosmate, Feb 04 '22 at 16:08

apangin · Accepted Answer · 2022-02-04T18:19:23.503

1

JIT compilation in HotSpot JVM is 1) based on runtime profile data; 2) uses speculative optimizations.

Once the method is compiled at the maximum optimization level, HotSpot stops profiling this code, so it is never recompiled afterwards, no matter how long the code runs. (The exception is when the method needs to be deoptimized or unloaded, but it's not your case).

In the first case (SELECT_QUICK == true), the condition remaining > fBuffer1.length - fIdx is never met, and HotSpot JVM is aware of that from profiling data collected at lower tiers. So it speculatively hoists the check out of the loop, and compiles the loop body with the assumption that array index is always within bounds. After the optimization, the loop is compiled like this (in pseudocode):

if (VALUE_COUNT > collector.fBuffer.fBuffer1.length) {
    uncommon_trap();
}
for (int k = 0; k < VALUE_COUNT; k++) {
    collector.fBuffer.fBuffer1[k] = k;  // no bounds check
}

In the second case (SELECT_QUICK == false), on the contrary, HotSpot knows that condition remaining > fBuffer2.length - fIdx is sometimes met, so it cannot eliminate the check.

Since fIdx is not the loop counter, HotSpot is apparently not smart enough to split the loop into two parts (with and without bounds check). However, you can help JIT compiler by splitting the loop manually:

for (long j = 0L; j < LOOP_COUNT; j++) {
    collector.reset();

    int fastCount = Math.min(collector.fBuffer.fBuffer2.length, VALUE_COUNT);
    for (int k = 0; k < fastCount; k++) {
        collector.addValueBuf2Fast(k);
    }

    for (int k = fastCount; k < VALUE_COUNT; k++) {
        collector.addValueBuf2(k);
    }
}

where addValueBuf2Fast inserts a value without bounds check:

    public void addValueBuf2Fast(long val) {fBuffer.add2Fast(val);}

    public static final class Buffer {
        ...
        public void add2Fast(long value) {
            fBuffer2[fIdx++] = value;
        }
    }

This should dramatically improve performance of the loop:

make buffer 2 larger
duration buf2 (resized): 5,537.681ms (5.5s)
duration buf2 (resized): 5,461.519ms (5.5s)
duration buf2 (resized): 5,450.445ms (5.5s)

edited Feb 04 '22 at 18:19

answered Feb 04 '22 at 16:32

apangin

92,924
10
193
247

Thanks for this great answer. I'm sitting here thinking how I can change the code according to your feedback. Unfortunately the submitting loop is just "simulating" the client code, which in fact is not under our control. Sometimes the incoming data is wildy distributed over different buffers and each buffer get a submission of a single value. In other words, there is no "loop" at all. According to your feedback we need a separate hot path for value submission deep to the buffer where the size condition is never triggered. Do you see any other "API" modification ignoring the loop? – foosmate Feb 04 '22 at 16:52
@Mythos In this particular example the large performance difference is a result of the loop optimizations. If there is no similar loop in the production code, I can only conclude that the given benchmark does not reflect the actual behavior of the real code. – apangin Feb 04 '22 at 17:05
I updated the loop a little to submit a non-constant value, to prevent Hotspot from simply "filling" the array. Even if each value has to be submitted individually, the huge performance difference is still there. Maybe we can allow the Hotspot doing a better job by splitting the methods and using an additional count-down threshold that switches between two pathes and only one enables the final length check? Or does it makes sense having some kind of "array-holder" that can be switched between a checked and a non-checked version? I'm just brainstorming ... – foosmate Feb 04 '22 at 17:18
I guess permanently skipping the check but catching the IndexOutOfBoundsException (which in turn triggers the array resize) is a bad idea. – foosmate Feb 04 '22 at 17:25
@Mythos I've pointed out the performance issue with your posted benchmark and suggested a way to speed it up. If this approach cannot be applied to your original problem, it means your benchmark is *not* about the original problem, so any further speculation about this benchmark will be likely wrong. – apangin Feb 04 '22 at 18:54
I just did some review of the client code where the performance difference heavily occurs. In fact there is always some kind of loop that pushes the data into one (or more) pipelines. So I am very sure that the benchmark matches the problem and you also explained the reason very well. But since we don't have access to the client code (to change the loop), the question is whether we can somehow make the API more "loop-friendly" to assist Hotspot better. – foosmate Feb 04 '22 at 19:11
BTW: The shown benchmark was mainly created by just deleting existing code from our original source code until the problem was minimized. So there was only glue code to be implemented "in addition". My first ideas (I mentioned above Counter, Exception, Holder ...) I have just tried and unfortunately could not achieve any effect. So if you have another starting point, how we can offer Hotspot a fast path on the API, then I would be very happy. In any case, a big thank you for the assistance and quick reaction. – foosmate Feb 04 '22 at 19:17