Why do trivial, seemingly unrelated code changes impact performance of Java code so much as measured in benchmarks?

Question

I noticed many times that small, trivial, seemingly unrelated code changes can alter the performance characteristics of a piece of Java code, sometimes dramatically.

This happens in both JMH and hand-rolled benchmarks.

For example, in a class like this:

class Class<T> implements Interface {
    private final Type field;

    Class(ClassBuilder builder) {
        field = builder.getField();
    }

    @Override
    void method() { /* ... */ }
}

I did this code change:

class Class<T> implements Interface {
    private static Class<?> instance;

    private final Type field;

    Class(Builder builder) {
        instance = this;
        field = builder.getField();
    }

    @Override
    void method() { /* ... */ }
}

and performance changed dramatically.

This is just one example. There are other cases where I noticed the same thing.

I cannot determine what causes this. I searched the web, but found no information.

To me, it looks totally uncontrollable. Maybe it has to do with how the compiled code is laid out in memory?

I do not think it is due to false sharing (see below).

I'm developing a spinlock:

class SpinLock {
    @Contended // Add compiler option: --add-exports java.base/jdk.internal.vm.annotation=<module-name> (if project is not modular, <module-name> is 'ALL-UNNAMED')
    private final AtomicBoolean state = new AtomicBoolean();

    void lock() {
        while (state.getAcquireAndSetPlain(true)) {
            while (state.getPlain()) { // With x86 PAUSE we avoid opaque load
                Thread.onSpinWait();
            }
        }
    }

    void unlock() {
        state.setRelease(false);
    }
}

class AtomicBoolean {
    private static final VarHandle VALUE;

    static {
        try {
            VALUE = MethodHandles.lookup().findVarHandle(AtomicBoolean.class, "value", boolean.class);
        } catch (ReflectiveOperationException e) {
            throw new ExceptionInInitializerError(e);
        }
    }

    private boolean value;

    public boolean getPlain() {
        return value;
    }

    public boolean getAcquireAndSetPlain(boolean value) {
        return (boolean) VALUE.getAndSetAcquire(this, value);
    }

    public void setRelease(boolean value) {
        VALUE.setRelease(this, value);
    }
}

My hand-rolled benchmark reported 171.26ns ± 43% and a JMH benchmark reported avgt 5 265.970 ± 27.712 ns/op.
When I change it like this:

class SpinLock {
    @Contended
    private final AtomicBoolean state = new AtomicBoolean();
    private final NoopBusyWaitStrategy busyWaitStrategy;

    SpinLock() {
        this(new NoopBusyWaitStrategy());
    }

    SpinLock(NoopBusyWaitStrategy busyWaitStrategy) {
        this.busyWaitStrategy = busyWaitStrategy;
    }

    void lock() {
        while (state.getAcquireAndSetPlain(true)) {
            busyWaitStrategy.reset(); // Will be inlined
            while (state.getPlain()) {
                Thread.onSpinWait();
                busyWaitStrategy.tick(); // Will be inlined
            }
        }
    }

    void unlock() {
        state.setRelease(false);
    }
}

class NoopBusyWaitStrategy {
    void reset() {}

    void tick() {}
}

My hand-rolled benchmark reported 184.24ns ± 48% and a JMH benchmark reported avgt 5 291.285 ± 20.860 ns/op. Even though the results of the two benchmarks are different, they both increase.
This is the JMH benchmark:

public class SpinLockBenchmark {
    @State(Scope.Benchmark)
    public static class BenchmarkState {
        final SpinLock lock = new SpinLock();
    }

    @Benchmark
    @Fork(value = 1, warmups = 1, jvmArgsAppend = {"-Xms8g", "-Xmx8g", "-XX:+AlwaysPreTouch", "-XX:+UnlockExperimentalVMOptions", "-XX:+UseEpsilonGC", "-XX:-RestrictContended"})
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @BenchmarkMode(Mode.AverageTime)
    @Threads(6)
    public void run(BenchmarkState state) {
        state.lock.lock();
        state.lock.unlock();
    }
}

Do you have any ideas?
Does it happen with languages without a runtime, too?

Change dramatically on _your_ framework, correct? If such, you know where to look already. — Eugene, Jul 25 '20 at 10:13
@Eugene No. I measured with JMH and noticed the same difference in performance. I did not include a MCVE because since I do not know how to reproduce it, I would have to include the code of a project I'm working on, which is too large and too specialized. Moreover, it happened many times with different pieces of code. — spongebob, Jul 25 '20 at 14:22
https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java — PeterMmm, Jul 25 '20 at 14:35
`V` in MCVE stands for "verifiable", but your example does not compile. There is no `getAcquireAndSetPlain` method in AtomicBoolean, and `@Contended` annotation is not available to user code. — apangin, Jul 25 '20 at 21:58
@apangin I've clarified my example. For `@Contended` to be available to user code, both a compiler and VM option need be added. — spongebob, Jul 25 '20 at 22:39

score 6 · Answer 1 · edited Jul 26 '20 at 13:33

Your "trivial" changes appear not that trivial.

You added busyWaitStrategy.tick() call in a hot loop, which results in an extra load, comparison and a [non-taken] conditional branch.

Even though the method does nothing, JLS requires to throw NullPointerException when a method is called on null object. So, JVM needs to load the field and check whether it is null. Although the field is declared final, HotSpot JVM does not treat it as a constant. And because of Thread.onSpinWait, the field load is not hoisted out of the loop, since it serves as a membar (see the discussion thread).

With the help of -XX:+PrintAssembly, we can indeed see this null pointer check in the compiled code:

    pause                     ;*invokestatic onSpinWait {reexecute=0 rethrow=0 return_oop=0}
                              ; - bench.SpinLock::lock@28 (line 23)
                              ; - bench.SpinLockBenchmark::run@4 (line 17)
                              ; - bench.generated.SpinLockBenchmark_run_jmhTest::run_avgt_jmhStub@17
 >> cmp     r12d,dword ptr [r10+10h]
 >> je      1a01286f719h      ;*invokevirtual tick {reexecute=0 rethrow=0 return_oop=0}
                              ; - bench.SpinLock::lock@35 (line 24)
                              ; - bench.SpinLockBenchmark::run@4 (line 17)
                              ; - bench.generated.SpinLockBenchmark_run_jmhTest::run_avgt_jmhStub@17
    mov     r8d,dword ptr [r10+0ch]  ;*getfield state {reexecute=0 rethrow=0 return_oop=0}
                              ; - bench.SpinLock::lock@19 (line 22)
                              ; - bench.SpinLockBenchmark::run@4 (line 17)
                              ; - bench.generated.SpinLockBenchmark_run_jmhTest::run_avgt_jmhStub@17

Also, @Contended annotation seems to be misused. As far as I understand the code, the intention was to protect AtomicBoolean object from false sharing, not the reference. Therefore it makes more sense to mark AtomicBoolean.value field or the entire AtomicBoolean class as @Contended.

For investigation of microbenchmark results, I recommend using JMH built in profilers: -prof perfasm and -prof perfnorm (btw, that's another reason for JMH over hand rolled frameworks). perfasm will show assembly code - the particular instructions that took most cpu cycles. perfnorm will output performance counters stats like instructions per cycle, cache misses, mispredicted branches etc.

Thank you, I've read the thread from start to end, but it makes limited sense to me. It surely looks like they are pointing to the fact that it acts as a load barrier, but I just can't make sense why is that needed. — Eugene, Jul 26 '20 at 03:10
Your explanation is interesting, and you seem to be right in regards to `@Contended`. Unfortunately, over multiple tests I keep getting different results. For instance, when I removed the calls to `reset()` and `tick()`, sometimes it was faster, sometimes it was slower (using JMH)... that is trivial changes impact performance in unpredictable ways, and sometimes results change even between the same test executed after some time. Do you have any ideas why this happens? Should I place my judjements based on hardware stats like cache misses, branch misses, etc. instead? — spongebob, Jul 26 '20 at 16:35
@FrancescoMenzani Your benchmark is inherently non-deterministic, since it exloits inter-thread and inter-cpu races. Add the JVM activities that are also non-deterministic and highly depend on timing (e.g. background JIT compilation), and also OS background tasks. Not to mention microarchitecture aspects like hyperthreading, cpu throttling, dynamic frequency scaling etc. Despite all this variance, you configured JMH to run just one fork with a few iterations, which makes the scores almost random. — apangin, Jul 26 '20 at 17:21
If you really want to get stable measurements of nanosecond latencies, use a dedicated server with no other tasks running, disable hyperthreading and cpu power management, run the benchmark in at least 10 forks with long enough warmup and many enough iterations. — apangin, Jul 26 '20 at 17:26
When I noticed that trivial changes impacted performance sometimes dramatically, I had run the benchmark multiple times and gotten the same results. — spongebob, Jul 26 '20 at 19:44

Why do trivial, seemingly unrelated code changes impact performance of Java code so much as measured in benchmarks?

1 Answers1