Why does this test take longer without garbage collection overhead?

Question

I ran into this scenario in the process of developing a lightweight library for asynchronous messaging. Trying to get an idea of the cost of creating lots of medium sized objects with short lifetimes, I wrote the following test:

import java.nio.ByteBuffer;
import java.util.Random;


public class MemPressureTest {
    static final int SIZE = 4096;
    static final class Bigish {
        final ByteBuffer b;


        public Bigish() {
            this(ByteBuffer.allocate(SIZE));
        }

        public Bigish(ByteBuffer b) {
            this.b = b;
        }

        public void fill(byte bt) {
            b.clear();
            for (int i = 0; i < SIZE; ++i) {
                b.put(bt);
            }
        }
    }


    public static void main(String[] args) {
        Random random = new Random(1);
        Bigish tmp = new Bigish();
        for (int i = 0; i < 3e7; ++i) {
            tmp.fill((byte)random.nextInt(255));
        }
    }
}

On my laptop, with default GC settings, it runs in about 95 seconds:

/tmp$ time java -Xlog:gc MemPressureTest
[0.006s][info][gc] Using G1

real    1m35.552s
user    1m33.658s
sys 0m0.428s

This is where things get strange. I tweaked the program to allocate a new object for each iteration:

...
        Random random = new Random(1);
        for (int i = 0; i < 3e7; ++i) {
            Bigish tmp = new Bigish();
            tmp.fill((byte)random.nextInt(255));
        }
...

In theory, this should add some small overhead, but none of the objects should ever be promoted out of Eden. At best, I'd expect the runtimes to be close to identical. However, this test completes in ~17 seconds:

/tmp$ time java -Xlog:gc MemPressureTest
[0.007s][info][gc] Using G1
[0.090s][info][gc] GC(0) Pause Young (Normal) (G1 Evacuation Pause) 23M->1M(130M) 1.304ms
[0.181s][info][gc] GC(1) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(130M) 0.870ms
[0.247s][info][gc] GC(2) Pause Young (Normal) (G1 Evacuation Pause) 76M->0M(130M) 0.844ms
[0.317s][info][gc] GC(3) Pause Young (Normal) (G1 Evacuation Pause) 75M->0M(130M) 0.793ms
[0.381s][info][gc] GC(4) Pause Young (Normal) (G1 Evacuation Pause) 75M->0M(130M) 0.859ms
[lots of similar GC pauses, snipped for brevity]
[16.608s][info][gc] GC(482) Pause Young (Normal) (G1 Evacuation Pause) 254M->0M(425M) 0.765ms
[16.643s][info][gc] GC(483) Pause Young (Normal) (G1 Evacuation Pause) 254M->0M(425M) 0.580ms
[16.676s][info][gc] GC(484) Pause Young (Normal) (G1 Evacuation Pause) 254M->0M(425M) 0.841ms

real    0m16.766s
user    0m16.578s
sys 0m0.576s

I ran both versions several times, with near identical results to the above. I feel like I must be missing something very obvious. Am I going insane? What could explain this difference in performance?

=== EDIT ===

I rewrote the test using JMH as per apangin and dan1st's suggestions:

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;

import java.nio.ByteBuffer;
import java.util.Random;


public class MemPressureTest {
    static final int SIZE = 4096;

    @State(Scope.Benchmark)
    public static class Bigish {
        final ByteBuffer b;
        private Blackhole blackhole;


        @Setup(Level.Trial)
        public void setup(Blackhole blackhole) {
            this.blackhole = blackhole;
        }

        public Bigish() {
            this.b = ByteBuffer.allocate(SIZE);
        }

        public void fill(byte bt) {
            b.clear();
            for (int i = 0; i < SIZE; ++i) {
                b.put(bt);
            }
            blackhole.consume(b);
        }
    }

    static Random random = new Random(1);


    @Benchmark
    public static void test1(Blackhole blackhole) {
        Bigish tmp = new Bigish();
        tmp.setup(blackhole);
        tmp.fill((byte)random.nextInt(255));
        blackhole.consume(tmp);
    }

    @Benchmark
    public static void test2(Bigish perm) {
        perm.fill((byte) random.nextInt(255));
    }
}

Still, the second test much slower:

> Task :jmh
# JMH version: 1.35
# VM version: JDK 18.0.1.1, OpenJDK 64-Bit Server VM, 18.0.1.1+2-6
# VM invoker: /Users/xxx/Library/Java/JavaVirtualMachines/openjdk-18.0.1.1/Contents/Home/bin/java
# VM options: -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/Users/xxx/Dev/MemTests/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.xxx.MemPressureTest.test1

# Run progress: 0.00% complete, ETA 00:16:40
# Fork: 1 of 5
# Warmup Iteration   1: 2183998.556 ops/s
# Warmup Iteration   2: 2281885.941 ops/s
# Warmup Iteration   3: 2239644.018 ops/s
# Warmup Iteration   4: 1608047.994 ops/s
# Warmup Iteration   5: 1992314.001 ops/s
Iteration   1: 2053657.571 ops/s3s]
Iteration   2: 2054957.773 ops/sm 3s]
Iteration   3: 2051595.233 ops/sm 13s]
Iteration   4: 2054878.239 ops/sm 23s]
Iteration   5: 2031111.214 ops/sm 33s]

# Run progress: 10.00% complete, ETA 00:15:04
# Fork: 2 of 5
# Warmup Iteration   1: 2228594.345 ops/s
# Warmup Iteration   2: 2257983.355 ops/s
# Warmup Iteration   3: 2063130.244 ops/s
# Warmup Iteration   4: 1629084.669 ops/s
# Warmup Iteration   5: 2063018.496 ops/s
Iteration   1: 1939260.937 ops/sm 33s]
Iteration   2: 1791414.018 ops/sm 43s]
Iteration   3: 1914987.221 ops/sm 53s]
Iteration   4: 1969484.898 ops/sm 3s]
Iteration   5: 1891440.624 ops/sm 13s]

# Run progress: 20.00% complete, ETA 00:13:23
# Fork: 3 of 5
# Warmup Iteration   1: 2228664.719 ops/s
# Warmup Iteration   2: 2263677.403 ops/s
# Warmup Iteration   3: 2237032.464 ops/s
# Warmup Iteration   4: 2040040.243 ops/s
# Warmup Iteration   5: 2038848.530 ops/s
Iteration   1: 2023934.952 ops/sm 14s]
Iteration   2: 2041874.437 ops/sm 24s]
Iteration   3: 2002858.770 ops/sm 34s]
Iteration   4: 2039727.607 ops/sm 44s]
Iteration   5: 2045827.670 ops/sm 54s]

# Run progress: 30.00% complete, ETA 00:11:43
# Fork: 4 of 5
# Warmup Iteration   1: 2105430.688 ops/s
# Warmup Iteration   2: 2279387.762 ops/s
# Warmup Iteration   3: 2228346.691 ops/s
# Warmup Iteration   4: 1438607.183 ops/s
# Warmup Iteration   5: 2059319.745 ops/s
Iteration   1: 1112543.932 ops/sm 54s]
Iteration   2: 1977077.976 ops/sm 4s]
Iteration   3: 2040147.355 ops/sm 14s]
Iteration   4: 1975766.032 ops/sm 24s]
Iteration   5: 2003532.092 ops/sm 34s]

# Run progress: 40.00% complete, ETA 00:10:02
# Fork: 5 of 5
# Warmup Iteration   1: 2240203.848 ops/s
# Warmup Iteration   2: 2245673.994 ops/s
# Warmup Iteration   3: 2096257.864 ops/s
# Warmup Iteration   4: 2046527.740 ops/s
# Warmup Iteration   5: 2050379.941 ops/s
Iteration   1: 2050691.989 ops/sm 35s]
Iteration   2: 2057803.100 ops/sm 45s]
Iteration   3: 2058634.766 ops/sm 55s]
Iteration   4: 2060596.595 ops/sm 5s]
Iteration   5: 2061282.107 ops/sm 15s]


Result "com.xxx.MemPressureTest.test1":
  1972203.484 ±(99.9%) 142904.698 ops/s [Average]
  (min, avg, max) = (1112543.932, 1972203.484, 2061282.107), stdev = 190773.683
  CI (99.9%): [1829298.786, 2115108.182] (assumes normal distribution)


# JMH version: 1.35
# VM version: JDK 18.0.1.1, OpenJDK 64-Bit Server VM, 18.0.1.1+2-6
# VM invoker: /Users/xxx/Library/Java/JavaVirtualMachines/openjdk-18.0.1.1/Contents/Home/bin/java
# VM options: -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/Users/xxx/Dev/MemTests/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.xxx.MemPressureTest.test2

# Run progress: 50.00% complete, ETA 00:08:22
# Fork: 1 of 5
# Warmup Iteration   1: 282751.407 ops/s
# Warmup Iteration   2: 283333.984 ops/s
# Warmup Iteration   3: 293785.079 ops/s
# Warmup Iteration   4: 268403.105 ops/s
# Warmup Iteration   5: 280054.277 ops/s
Iteration   1: 279093.118 ops/s9m 15s]
Iteration   2: 282782.996 ops/s9m 25s]
Iteration   3: 282688.921 ops/s9m 35s]
Iteration   4: 291578.963 ops/s9m 45s]
Iteration   5: 294835.777 ops/s9m 55s]

# Run progress: 60.00% complete, ETA 00:06:41
# Fork: 2 of 5
# Warmup Iteration   1: 283735.550 ops/s
# Warmup Iteration   2: 283536.547 ops/s
# Warmup Iteration   3: 294403.173 ops/s
# Warmup Iteration   4: 284161.042 ops/s
# Warmup Iteration   5: 281719.077 ops/s
Iteration   1: 276838.416 ops/s10m 56s]
Iteration   2: 284063.117 ops/s11m 6s]
Iteration   3: 282361.985 ops/s11m 16s]
Iteration   4: 289125.092 ops/s11m 26s]
Iteration   5: 294236.625 ops/s11m 36s]

# Run progress: 70.00% complete, ETA 00:05:01
# Fork: 3 of 5
# Warmup Iteration   1: 284567.336 ops/s
# Warmup Iteration   2: 283548.713 ops/s
# Warmup Iteration   3: 294317.511 ops/s
# Warmup Iteration   4: 283501.873 ops/s
# Warmup Iteration   5: 283691.306 ops/s
Iteration   1: 283462.749 ops/s12m 36s]
Iteration   2: 284120.587 ops/s12m 46s]
Iteration   3: 264878.952 ops/s12m 56s]
Iteration   4: 292681.168 ops/s13m 6s]
Iteration   5: 295279.759 ops/s13m 16s]

# Run progress: 80.00% complete, ETA 00:03:20
# Fork: 4 of 5
# Warmup Iteration   1: 284823.519 ops/s
# Warmup Iteration   2: 283913.207 ops/s
# Warmup Iteration   3: 294401.483 ops/s
# Warmup Iteration   4: 283998.027 ops/s
# Warmup Iteration   5: 283987.408 ops/s
Iteration   1: 278014.618 ops/s14m 17s]
Iteration   2: 283431.491 ops/s14m 27s]
Iteration   3: 284465.945 ops/s14m 37s]
Iteration   4: 293202.934 ops/s14m 47s]
Iteration   5: 290059.807 ops/s14m 57s]

# Run progress: 90.00% complete, ETA 00:01:40
# Fork: 5 of 5
# Warmup Iteration   1: 285598.809 ops/s
# Warmup Iteration   2: 284434.916 ops/s
# Warmup Iteration   3: 294355.547 ops/s
# Warmup Iteration   4: 284307.860 ops/s
# Warmup Iteration   5: 284297.362 ops/s
Iteration   1: 283676.043 ops/s15m 57s]
Iteration   2: 283609.750 ops/s16m 7s]
Iteration   3: 284575.124 ops/s16m 17s]
Iteration   4: 293564.269 ops/s16m 27s]
Iteration   5: 216267.883 ops/s16m 37s]


Result "com.xxx.MemPressureTest.test2":
  282755.844 ±(99.9%) 11599.112 ops/s [Average]
  (min, avg, max) = (216267.883, 282755.844, 295279.759), stdev = 15484.483
  CI (99.9%): [271156.731, 294354.956] (assumes normal distribution)

The JMH Blackhole should prevent code removal and the fact that JMH is now in charge of running separate iterations should prevent parallelization, right? Shouldn't Blackhole also stop the object from being confined to the stack? Also, wouldn't there be more variation between warmup iterations if hotspot were still doing a significant amount of optimization?

Every single time people write a Java benchmark manually, they end up measuring an OSR compilation. I already showed many times why this is wrong: [1](https://stackoverflow.com/a/54410675/3448419), [2](https://stackoverflow.com/a/33858960/3448419), [3](https://stackoverflow.com/a/42590152/3448419), [4](https://stackoverflow.com/a/46637432/3448419). Just [don't write](https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) a microbechmark from scratch. Use [JMH](https://github.com/openjdk/jmh). — apangin, Aug 03 '22 at 19:26
Thanks for rewriting, now the benchmark makes sense. Replace `b.put(bt);` with `b.put(i, bt);` and the results become expected. I'll write a detailed answer a bit later. In short, JIT can't optimize the updates of ByteBuffer's internal state in the second case, because the creation of that ByteBuffer and its backing array is not in the scope of the compilation. — apangin, Aug 04 '22 at 03:39
As a practical consequence, `ByteBuffer.put` with an absolute index is almost always preferable to relative `put` from the performance perspective. — apangin, Aug 04 '22 at 03:42
@apangin thanks! do you have any books on the topic you'd recommend? — 735Tesla, Aug 04 '22 at 04:00

score 9 · Accepted Answer · answered Aug 07 '22 at 15:56

Creating a new ByteBuffer right before filling, indeed helps JIT compiler to produce better optimized code, when you use relative put methods, and here is why.

JIT compilation unit is a method. HotSpot JVM does not perform a whole-program optimization, which is quite hard even in theory due to dynamic nature of Java and the open world runtime environment.
When the JVM compiles test1 method, buffer instantiation appears in the same compilation scope as filling:

Bigish tmp = new Bigish();
tmp.setup(blackhole);
tmp.fill((byte)random.nextInt(255));

The JVM knows everything about the created buffer: its exact size and its backing array, it knows the buffer has not been published yet, no other thread sees it. So, the JVM can agressively optimize the fill loop: vectorize it using AVX instructions and unroll it to set 512 bytes at a time:

  0x000001cdf60c9ae0:   mov    %r9d,%r8d
  0x000001cdf60c9ae3:   movslq %r8d,%r9
  0x000001cdf60c9ae6:   add    %r11,%r9
  0x000001cdf60c9ae9:   vmovdqu %ymm0,0x10(%rcx,%r9,1)
  0x000001cdf60c9af0:   vmovdqu %ymm0,0x30(%rcx,%r9,1)
  0x000001cdf60c9af7:   vmovdqu %ymm0,0x50(%rcx,%r9,1)
  0x000001cdf60c9afe:   vmovdqu %ymm0,0x70(%rcx,%r9,1)
  0x000001cdf60c9b05:   vmovdqu %ymm0,0x90(%rcx,%r9,1)
  0x000001cdf60c9b0f:   vmovdqu %ymm0,0xb0(%rcx,%r9,1)
  0x000001cdf60c9b19:   vmovdqu %ymm0,0xd0(%rcx,%r9,1)
  0x000001cdf60c9b23:   vmovdqu %ymm0,0xf0(%rcx,%r9,1)
  0x000001cdf60c9b2d:   vmovdqu %ymm0,0x110(%rcx,%r9,1)
  0x000001cdf60c9b37:   vmovdqu %ymm0,0x130(%rcx,%r9,1)
  0x000001cdf60c9b41:   vmovdqu %ymm0,0x150(%rcx,%r9,1)
  0x000001cdf60c9b4b:   vmovdqu %ymm0,0x170(%rcx,%r9,1)
  0x000001cdf60c9b55:   vmovdqu %ymm0,0x190(%rcx,%r9,1)
  0x000001cdf60c9b5f:   vmovdqu %ymm0,0x1b0(%rcx,%r9,1)
  0x000001cdf60c9b69:   vmovdqu %ymm0,0x1d0(%rcx,%r9,1)
  0x000001cdf60c9b73:   vmovdqu %ymm0,0x1f0(%rcx,%r9,1)
  0x000001cdf60c9b7d:   mov    %r8d,%r9d
  0x000001cdf60c9b80:   add    $0x200,%r9d
  0x000001cdf60c9b87:   cmp    %r10d,%r9d
  0x000001cdf60c9b8a:   jl     0x000001cdf60c9ae0

You use relative put method. It not only sets a byte in a ByteBuffer, but also updates the position field. Note that the above vectorized loop does not update the position in memory. JVM sets it just once after the loop - it is allowed to do so as long as nobody can observe an inconsistent state of the buffer.
Now try to publish ByteBuffer before filling:

Bigish tmp = new Bigish();
volatileField = tmp;  // publish
tmp.setup(blackhole);
tmp.fill((byte)random.nextInt(255));

The loop optimization breaks; now the array bytes are filled one by one, and the position field is incremented accordingly.

  0x000001829b18ca5c:   nopl   0x0(%rax)
  0x000001829b18ca60:   cmp    %r11d,%esi
  0x000001829b18ca63:   jge    0x000001829b18ce34           ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.nio.Buffer::nextPutIndex@10 (line 721)
                                                            ; - java.nio.HeapByteBuffer::put@6 (line 209)
                                                            ; - bench.MemPressureTest$Bigish::fill@22 (line 33)
                                                            ; - bench.MemPressureTest::test1@28 (line 47)
  0x000001829b18ca69:   mov    %esi,%ecx
  0x000001829b18ca6b:   add    %edx,%ecx                    ;*getfield position {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.nio.Buffer::nextPutIndex@1 (line 720)
                                                            ; - java.nio.HeapByteBuffer::put@6 (line 209)
                                                            ; - bench.MemPressureTest$Bigish::fill@22 (line 33)
                                                            ; - bench.MemPressureTest::test1@28 (line 47)
  0x000001829b18ca6d:   mov    %esi,%eax
  0x000001829b18ca6f:   inc    %eax                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - bench.MemPressureTest$Bigish::fill@26 (line 32)
                                                            ; - bench.MemPressureTest::test1@28 (line 47)
  0x000001829b18ca71:   mov    %eax,0x18(%r10)              ;*putfield position {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.nio.Buffer::nextPutIndex@25 (line 723)
                                                            ; - java.nio.HeapByteBuffer::put@6 (line 209)
                                                            ; - bench.MemPressureTest$Bigish::fill@22 (line 33)
                                                            ; - bench.MemPressureTest::test1@28 (line 47)
  0x000001829b18ca75:   cmp    %r8d,%ecx
  0x000001829b18ca78:   jae    0x000001829b18ce14
  0x000001829b18ca7e:   movslq %esi,%r9
  0x000001829b18ca81:   add    %r14,%r9
  0x000001829b18ca84:   mov    %bl,0x10(%rdi,%r9,1)         ;*bastore {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.nio.HeapByteBuffer::put@13 (line 209)
                                                            ; - bench.MemPressureTest$Bigish::fill@22 (line 33)
                                                            ; - bench.MemPressureTest::test1@28 (line 47)
  0x000001829b18ca89:   cmp    $0x1000,%eax
  0x000001829b18ca8f:   jge    0x000001829b18ca95           ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - bench.MemPressureTest$Bigish::fill@14 (line 32)
                                                            ; - bench.MemPressureTest::test1@28 (line 47)
  0x000001829b18ca91:   mov    %eax,%esi
  0x000001829b18ca93:   jmp    0x000001829b18ca5c

That's exactly what happens in test2. Since the ByteBuffer object is external to the compilation scope, JIT can't optimize it as freely as a local not-yet-published object.

Is it possible at all to optimize the fill loop in case of an external buffer?

The good news, it is possible. Just use the absolute put method instead of relative. In this case, position field remains unchanged, and JIT can easily vectorize the loop without a risk of breaking ByteBuffer invariants.

for (int i = 0; i < SIZE; ++i) {
    b.put(i, bt);
}

With this change, the loop will be vectorized in both cases. Even better, now test2 becomes a lot faster than test1, proving that an object creation indeed has a performance overhead.

Benchmark               Mode  Cnt      Score     Error   Units
MemPressureTest.test1  thrpt   10   2447,370 ± 146,804  ops/ms
MemPressureTest.test2  thrpt   10  15677,575 ± 136,075  ops/ms

Conclusion

The counterintuitive performance difference was caused by the JVM inability to vectorize the fill loop when the ByteBuffer object creation is not in the compilation scope.
Prefer absolute get/put methods to relative ones where possible. Absolute methods are usually much faster, since they do not update the internal state of ByteBuffer, and JIT can apply more agressive optimizations.
Object creation indeed has an overhead, as the modified benchmark shows.

Those `vmovdqu` instructions are setting 256 *bits*, not 512 bytes. You're thinking of AVX-512, but even that would be 512 *bits* per instruction. — Cole Tobin, Aug 09 '22 at 18:49
@ColeTobin I can calculate bits, thanks. The above code snippet has 16 `vmovdqu` instructions in a row, which together set 16*32 = 512 bytes per iteration. — apangin, Aug 09 '22 at 20:07
Since none of the `ByteBuffer`’s fields is `volatile` and there’s no synchronization primitive involved after the `volatileField = tmp;` assignment, there is no real reason not to optimize the published buffer the same way as the purely local buffer. If another thread observes an inconsistent state, it’s the programmer’s fault and a legal JVM behavior according to the JMM. It would be a different thing if we were talking about eliminating the initial zero filling which happens before the assignment. — Holger, Aug 23 '22 at 07:45

score 2 · Answer 2 · edited Aug 03 '22 at 22:09

Disclaimer

The following are just a theories, they might be completely wrong. I am neither a JIT nor a GC expert.

code removal

I think that the JIT just optimized away (some of) your code. If that is the case, it detected that you are not actually using the stored values and just removed the code allocating/filling the object. Things like JmH's black hole might help you with that.

Parallelization

It could also be the case that it parallelized your code. Since loop different iterations are independent of each other, it is possible to execute multiple iterations in parallel.

Stack allocations

Another possibility is that it detected that the object is stack confined and has a very narrow scope and is removed immediately. Because of that, it might have moved your object to the stack where it can be allocated/pushed and deallocated/popped quickly.

Closing thoughts

The JIT might always do unexpected things. Don't optimize prematurely and don't guess where your bottlenecks are. Measure your performance before and after any changes you do. Performance might not get lost you expect it to. This is also the case in other languages but especially in Java.

And, as apangin mentioned in the comments, you should really use JMH.

score 1 · Answer 3 · answered Aug 10 '22 at 16:53

You original question and the edited JMH version are actually slightly different.

In the edited version, like @apangin mentioned, it's pointer perm stored in static field prevents the code get optimized.

In your original question, it's because you forgot to warm up. Here's a modified version:

    public static void main(String[] args) {
        var t1 = System.currentTimeMillis();
        var warmup = Integer.parseInt(args[0]);
        for (int i = 0; i < warmup; i++) { test(1); }  // magic!!!
        test(1000000);
        var t2 = System.currentTimeMillis();
        System.out.println(t2 - t1);
    }
    
    private static void test(int n) {
        Random random = new Random(1);
        Bigish tmp = new Bigish();
        for (int i = 0; i < n; ++i) {
            tmp.fill((byte) random.nextInt(255));
        }
    }

It takes an int argument warmup to help JVM decide which methods should be inlined.

On my machine, which is OpenJDK Runtime Environment Zulu17.28+13-CA (build 17+35-LTS) on Windows, when warmup is 8000, the output is unpredictable. It usually takes ~2.7 seconds, but occasionally it only takes 110 milliseconds.

When warmup is set to 8500, it almost always completes in 110~120 milliseconds.

You may also run with -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining options to see how JVM inline the methods. If everything is fully inlined, you should be able to see something like

  @ 24   A$Bigish::<init> (11 bytes)   inline (hot)
    @ 4   java.nio.ByteBuffer::allocate (20 bytes)   inline (hot)
      @ 16   java.nio.HeapByteBuffer::<init> (21 bytes)   inline (hot)
        @ 10   java.nio.ByteBuffer::<init> (47 bytes)   inline (hot)
          @ 8   java.nio.Buffer::<init> (105 bytes)   inline (hot)
            @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
            @ 39   java.nio.ByteBuffer::limit (6 bytes)   inline (hot)
              @ 2   java.nio.ByteBuffer::limit (8 bytes)   inline (hot)
                @ 2   java.nio.Buffer::limit (65 bytes)   inline (hot)
            @ 45   java.nio.ByteBuffer::position (6 bytes)   inline (hot)
              @ 2   java.nio.ByteBuffer::position (8 bytes)   inline (hot)
                @ 2   java.nio.Buffer::position (52 bytes)   inline (hot)
          @ 17   java.nio.ByteOrder::nativeOrder (4 bytes)   inline (hot)
    @ 7   A$Bigish::<init> (10 bytes)   inline (hot)
      @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)

close to the bottom of the output.

Notice that only when the consturctor of Bigish and ByteBuffer are fully inlined, then JVM can predicate that the underlying buffer will never be visible to another thread, which allows writes to the buffer to be safely vectorized.

BTW, this is yet another case shows how tricky benchmarking is. Without digging into the the detail, it's difficult to tell which part is the real performance bottleneck. Even JMH can be misleading.