CAS vs synchronized performance

Question

I've had this question for quite a while now, trying to read lots of resources and understanding what is going on - but I've still failed to get a good understanding of why things are the way they are.

Simply put I'm trying to test how a CAS would perform vs synchronized in contended and not environments. I've put up this JMH test:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@State(Scope.Benchmark)
public class SandBox {

    Object lock = new Object();

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder().include(SandBox.class.getSimpleName())
                .jvmArgs("-ea", "-Xms10g", "-Xmx10g")
                .shouldFailOnError(true)
                .build();
        new Runner(opt).run();
    }

    @State(Scope.Thread)
    public static class Holder {

        private long number;

        private AtomicLong atomicLong;

        @Setup
        public void setUp() {
            number = ThreadLocalRandom.current().nextLong();
            atomicLong = new AtomicLong(number);
        }
    }

    @Fork(1)
    @Benchmark
    public long sync(Holder holder) {
        long n = holder.number;
        synchronized (lock) {
            n = n * 123;
        }

        return n;
    }

    @Fork(1)
    @Benchmark
    public AtomicLong cas(Holder holder) {
        AtomicLong al = holder.atomicLong;
        al.updateAndGet(x -> x * 123);
        return al;
    }

    private Object anotherLock = new Object();

    private long anotherNumber = ThreadLocalRandom.current().nextLong();

    private AtomicLong anotherAl = new AtomicLong(anotherNumber);

    @Fork(1)
    @Benchmark
    public long syncShared() {
        synchronized (anotherLock) {
            anotherNumber = anotherNumber * 123;
        }

        return anotherNumber;
    }

    @Fork(1)
    @Benchmark
    public AtomicLong casShared() {
        anotherAl.updateAndGet(x -> x * 123);
        return anotherAl;
    }

    @Fork(value = 1, jvmArgsAppend = "-XX:-UseBiasedLocking")
    @Benchmark
    public long syncSharedNonBiased() {
        synchronized (anotherLock) {
            anotherNumber = anotherNumber * 123;
        }

        return anotherNumber;
    }

}

And the results:

Benchmark                                           Mode  Cnt     Score      Error  Units
spinLockVsSynchronized.SandBox.cas                  avgt    5   212.922 ±   18.011  ns/op
spinLockVsSynchronized.SandBox.casShared            avgt    5  4106.764 ± 1233.108  ns/op
spinLockVsSynchronized.SandBox.sync                 avgt    5  2869.664 ±  231.482  ns/op
spinLockVsSynchronized.SandBox.syncShared           avgt    5  2414.177 ±   85.022  ns/op
spinLockVsSynchronized.SandBox.syncSharedNonBiased  avgt    5  2696.102 ±  279.734  ns/op

In the non-shared case CASis by far faster, which I would expect. But in shared case, things are the other way around - and this I can't understand. I don't think this is related to biased locking, as that would happen after a threads holds the lock for 5 seconds (AFAIK) and this does not happen and the test is just proof of that.

I honestly hope it's just my tests that are wrong, and someone having jmh expertise would come along and just point me to the wrong set-up here.

casShared is at least two volatile mem-ops (one for updateAndGet, one for get), syncShared is a single lock op. Also, why Holder param on shared benchmarks? — Oleg Estekhin, Sep 07 '17 at 09:56
I think you want not just contended vs uncontended, but also an in between case. Heavy contention should show synchronized doing better. — Nathan Hughes, Sep 07 '17 at 11:10
@OlegEstekhin thank you, you are correct indeed... I've up voted two of your other answers to show my appreciation for your time here. — Eugene, Sep 07 '17 at 11:26
@NathanHughes indeed... the results actually prove you right; at this point I don't entirely understand why all the internals of jdk's atomic/concurrent are done with `CAS`. they assume less contention? Or a finer-grained lock that could be taken for far less time? — Eugene, Sep 07 '17 at 11:28
Wouldn’t you have to create additional threads to test the *contended* case? Ok, I see in your answer that you were using `@Threads(50)` in at least one test, but this information is missing in the question. It’s not clear what number of threads were used for the results posted in the question. — Holger, Sep 07 '17 at 11:48
I’m not sure whether this is intentional or a fundamental error in your testing, but there is no actual difference between `sync` and `syncShared`as in both cases, you are synchronizing on a shared object. It only differs in the locality of the variable you’re updating, but, of course, it makes no sense to acquire a global lock to update an unshared variable. — Holger, Sep 07 '17 at 11:59
@Holger `sync`is taking a `Holder` as input; that is turn is annotated with `@State(Scope.Thread)` - each thread that runs the test will have it's own copy of that `Holder`. I think this explains it better : http://hg.openjdk.java.net/code-tools/jmh/file/b46a93657b82/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_03_States.java — Eugene, Sep 07 '17 at 12:30
But you are synchronizing on `lock`, which is an instance variable of `SandBox`, containing a shared object, rather than the local `Holder` object. — Holger, Sep 07 '17 at 12:31
@Holger crap! I've done so much editing that I've confused myself with these tests, will update... — Eugene, Sep 07 '17 at 12:36
The point of the `sync` test eludes me. Unlike `cas[Shared]`, it reads a shared value, but it does not publish a new one, and the read of the initial value isn't even guarded by the lock. Does `holder.number` even get updated between invocations during the same trial? — Mike Strobel, Sep 07 '17 at 16:22
@MikeStrobel this question is just plain awful, I admit. I could not stop my anxiety to ask this and get an explanation which is in general a flaw I have. I will update it accordingly with proper tests (I hope) and proper reasoning that hopefully will make more sense. I will tag everyone here interested when Im done. — Eugene, Sep 07 '17 at 19:18
@Eugene What is the reasoning behind long n = holder.number; and AtomicLong al = holder.atomicLong; prior to actual operations. Any particular reasons for this? Also the return value, is necessary? I am just saying that can affect the tests. What is the contention details? How many threads? Not familiar with this testing library. Are these all run as is? Fail to see the difference in setup between cas and casshared — mjs, Sep 07 '17 at 20:01
@Eugene create two separate tests, so it's clear to everyone thart one is contentended and the other not. — mjs, Sep 07 '17 at 20:04

score 38 · Accepted Answer · answered Sep 07 '17 at 15:07

The main misconception is the assumption that you are comparing “CAS vs. synchronized”. Given, how sophisticated JVMs implement synchronized, you are comparing the performance of a CAS-based algorithm using AtomicLong with the performance of the CAS-based algorithm used to implement synchronized.

Similar to Lock, the internal information for an object monitor basically consist of an int status telling whether it has been owned and how often it is nested, a reference to the current owner thread and a queue of threads waiting to be able to acquire it. The expensive aspect is the waiting queue. Putting a thread into the queue, removing it from thread scheduling, and eventually waking it up when the current owner releases the monitor, are operations that can take a significant time.

However, in the uncontended case, the waiting queue is, of course, not involved. Acquiring the monitor consist of a single CAS to change the status from “unowned” (usually zero) to “owned, acquired once” (guess the typical value). If successful, the thread can proceed with the critical action, followed by a release which implies just writing the “unowned” state with the necessary memory visibility and waking up another blocked thread, if there is one.

Since the wait queue is the significantly more expensive thing, implementations usually try to avoid it even in the contended case by performing some amount of spinning, making several repeated CAS attempts before falling back to enqueuing the thread. If the critical action of the owner is as simple as a single multiplication, chances are high that the monitor will be released during the spinning phase already. Note that synchronized is “unfair”, allowing a spinning thread to proceed immediately, even if there are already enqueued threads waiting far longer.

If you compare the fundamental operations performed by the synchronized(lock){ n = n * 123; } when no queuing is involved and by al.updateAndGet(x -> x * 123);, you’ll notice that they are roughly on par. The main difference is that the AtomicLong approach will repeat the multiplication on contention while for the synchronized approach, there is a risk of being put into the queue if no progress has been made during spinning.

But synchronized allows lock coarsening for code repeatedly synchronizing on the same object, which might be relevant for a benchmark loop calling the syncShared method. Unless there’s also a way to fuse multiple CAS updates of an AtomicLong, this can give synchronized a dramatic advantage. (See also this article covering several aspects discussed above)

Note that due to the “unfair” nature of synchronized, creating far more threads than CPU cores doesn’t have to be a problem. In the best case, “number of threads minus number of cores” threads end up on the queue, never waking up, while the remaining threads succeed in the spinning phase, one thread on each core. But likewise, threads not running on a CPU core can’t reduce the performance of the AtomicLong update as they can neither, invalidate the current value to other threads nor make a failed CAS attempt.

In either case, when CASing on the member variable of an unshared object or when performing synchronized on an unshared object, the JVM may detect the local nature of the operation and elide most of the associated costs. But this may depend on several subtle environmental aspects.

The bottom line is that there is no easy decision between atomic updates and synchronized blocks. Things get far more interesting with more expensive operations, which may raise the likelihood of threads getting enqueued in the contended case of synchronized, which may make it acceptable that the operation has to be repeated in the contended case of an atomic update.

I would have unwisely assumed that JMH would prevent inlining of benchmark methods by default, which would (I think?) prevent lock coarsening. It would appear that is not the case. Good catch, and excellent answer. — Mike Strobel, Sep 07 '17 at 17:57
@Holger I could just accept it now, but I really want to get it straight. Due to my anxiety, I keep making this mistake over and over again, writing things fast without checking. I will update the question and if you don't mind I *might* have so follow-up questions. — Eugene, Sep 07 '17 at 19:22
@MikeStrobel we *can* force the non-inlining in `jmh`, I just did not think about it. But I can't see why that would be needed at all - since every benchmark is run in separate process, via `@Fork`... — Eugene, Sep 07 '17 at 19:37
@Eugene Yes, but for each iteration, the individual _invocations_ are run in a tight loop. Your benchmark method may get inlined by its caller, or its caller's caller, etc. — Mike Strobel, Sep 07 '17 at 19:54
This viewpoint answers it: "Threads not running on a CPU core can’t reduce the performance of the AtomicLong update as they can neither, invalidate the current value to other threads nor make a failed CAS attempt." meaning major contention in an environment where this operation is not the only operation being performed making it unlikely to encounter repeated while loop iterations. In an environment like the benchmark here, where this is the only operation, things could "get out of hand". — mjs, Sep 08 '17 at 20:24

Mike Strobel · Answer 2 · 2017-09-07T20:17:44.990

You should read, re-read, and accept @Holger's excellent answer, as the insights it provides are far more valuable than a single set of benchmark numbers from one developer's workstation.

I tweaked your benchmarks to make them a bit more apples-to-apples, but if you read @Holger's answer, you'll see why this isn't a terribly useful test. I'm going to include my changes and my results simply to show how results can vary from one machine (or one JRE version) to another.

First, my version of benchmarks:

@State(Scope.Benchmark)
public class SandBox {
    public static void main(String[] args) throws RunnerException {
        new Runner(
            new OptionsBuilder().include(SandBox.class.getSimpleName())
                                .shouldFailOnError(true)
                                .mode(Mode.AverageTime)
                                .timeUnit(TimeUnit.NANOSECONDS)
                                .warmupIterations(5)
                                .warmupTime(TimeValue.seconds(5))
                                .measurementIterations(5)
                                .measurementTime(TimeValue.seconds(5))
                                .threads(-1)
                                .build()
        ).run();
    }

    private long number = 0xCAFEBABECAFED00DL;
    private final Object lock = new Object();
    private final AtomicLong atomicNumber = new AtomicLong(number);

    @Setup(Level.Iteration)
    public void setUp() {
        number = 0xCAFEBABECAFED00DL;
        atomicNumber.set(number);
    }

    @Fork(1)
    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public long casShared() {
        return atomicNumber.updateAndGet(x -> x * 123L);
    }

    @Fork(1)
    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public long syncShared() {
        synchronized (lock) {
            return number *= 123L;
        }
    }

    @Fork(value = 1, jvmArgsAppend = "-XX:-UseBiasedLocking")
    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public long syncSharedNonBiased() {
        synchronized (lock) {
            return number *= 123L;
        }
    }
}

And then my first batch of results:

# VM version: JDK 1.8.0_60, VM 25.60-b23

Benchmark                    Mode  Cnt     Score     Error  Units
SandBox.casShared            avgt    5   976.215 ± 167.865  ns/op
SandBox.syncShared           avgt    5  1820.554 ±  91.883  ns/op
SandBox.syncSharedNonBiased  avgt    5  1996.305 ± 124.681  ns/op

Recall that you saw synchronized coming out ahead under high contention. On my workstation, the atomic version fared better. If you use my version of your benchmarks, what results do you see? It won't surprise me in the least if they're substantially different.

Here's another set run under a months-old Java 9 EA release:

# VM version: JDK 9-ea, VM 9-ea+170

Benchmark                    Mode  Cnt     Score     Error  Units
SandBox.casShared            avgt    5   979.615 ± 135.495  ns/op
SandBox.syncShared           avgt    5  1426.042 ±  52.971  ns/op
SandBox.syncSharedNonBiased  avgt    5  1649.868 ±  48.410  ns/op

The difference is less dramatic here. It's not terribly unusual to see a difference across major JRE versions, but who's to say you won't see them across minor releases too?

At the end of the day, the results are close. Very close. The performance of synchronized has come a long way since the early Java versions. If you are not writing HFT algorithms or something else that's incredibly latency sensitive, you should prefer the solution that's most easily proven correct. It is generally easier to reason about synchronized than lock-free algorithms and data structures. If you cannot demonstrate a measurable difference in your application, then synchronized is what you should use.

let me put it another way... there is *not much* in Holger's answer that I did not already knew (it does not make it less excellent in any way), except some details in thread scheduling. My ultimate question was so different that you would not believe me know, I got carried away and made it terrible. My question was *why* would I ever not want to use `synchronized` over a spin-lock let's say? as such I hoped to prove that there would a diff between them so visible that I could simply conclude for myself that CAS *is* better — Eugene, Sep 07 '17 at 20:10
I added a final paragraph that directly addresses that question. — Mike Strobel, Sep 07 '17 at 20:18
thank you for the code, but i have a problem with it; that `DONT_INLINE` is probably doing nothing; a `MaxInlineLevel` of a default to 9 would be hit really fast. I could easily check, but may be tomorrow. Still thank you for taking the time to answer this. — Eugene, Sep 07 '17 at 20:22
You can't assume anything from `MaxInlineLevel`. You don't know what your relative inlining "root" is. There are plenty of reasons why the call sites above you might not be inlined. For example: (1) they aren't called enough times to break out of interpreted mode; (2) they're too large to fit within the 'regular' inlining threshold; (3) they aren't called often enough to hit the 'frequently called' inlining threshold; (4) they were already compiled into a medium- or large-sized method by the time they were eligible for inlining; (5) the max inlining depth was already hit; etc. — Mike Strobel, Sep 07 '17 at 20:46
@Eugene:if your application would do cas-only on a variable, it wouldn't be much useful. It becomes interesting when also reading the result occasionally and cas can be combined with other threads doing an atomic read on the same variable. And atomic reads may perform much better than `synchronized`, especially when you have lots of concurrent reads. Still, it is not easy. Java 8’s `ConcurrentHashMap` uses atomic reads, cas *and* `synchronized`, depending on the operation. — Holger, Sep 07 '17 at 23:15
@MikeStrobel exactly, I was just giving an example via `MaxInlineLevel` that `DONT_INLINE` might not do anything useful in the tests. — Eugene, Sep 08 '17 at 06:42
@Holger `Still, it is not easy`... I feel like a total idiot not understanding these things no matter how much I read or try. You've been very helpful as usual. thank you — Eugene, Sep 08 '17 at 07:12
@MikeStrobel don't get me wrong, I am not versatile enough to make absolute judgments about this; I am very thankful for you insight and answer. — Eugene, Sep 08 '17 at 07:18
@Eugene: no need to “feel like a total idiot”, as this *is* a complex specialized topic, and even worse, old wisdom gets outdated over time. While the Internet is still full of obsolete information or even tenacious myths… — Holger, Sep 08 '17 at 07:56
@Eugene I have nothing but admiration for your desire to understand these low-level details, and nothing I've posted was meant to discourage you. I hope it hasn't come across that way. By all means, chase knowledge, and chase it through experimentation. The more you do, the more you will appreciate just how difficult these things are to measure, and how useless some measurements are in the context of far more complex programs where high-level design decisions have a much greater impact. The phrase "premature optimization is the root of all evil" is well traveled for a reason ;). — Mike Strobel, Sep 08 '17 at 11:51
@MikeStrobel threads(-1) is "use all hardware threads"? How many on your box? — igaz, Sep 10 '17 at 18:07

score 3 · Answer 3 · answered Sep 07 '17 at 17:18

Note that CAS can give you more fine-grained ordering (non-)guarantees than a synchronized block, especially with java-9 varhandles which provide ordering options aligned with the C++11 memory model.

If all you want to do is some statistics-keeping from multiple threads then a read-compute-update loop with the most relaxed memory orderings available (plain read; plain and weak CAS) may perform better on weakly ordered platforms since it won't need any barriers and the cas won't have to do wasteful internal looping if it's implemented on top of LL/SC. Additionally it will also give the JITs more freedom to reorder the instructions around those atomics. compareAndExchange can eliminate an additional read on loop repetition.

Another complication is also how you measure performance. All of the implementations should have progress-guarantees, i.e. even under contention at least one can finish at a time. So in principle you could be wasting CPU cycles on multiple threads attempting to update your variable concurrently but still be better on the measure of 99th percentile latency because atomic operations won't resort to descheduling the thread and worse on worst-case latency because they're not fair. So just measuring averages might not tell the whole story here.

as usual, much appreciated your answers/comments. You have an excellent point, I wish I could upvote it twice! 1) Yes, those `statistics` are the only reason I would ever adventure to use relaxed atomics, it's a darn sharp tool. Herb Sutter says that there are 5 people on this planet that know how to use them correctly all the time - and I am definitely not one of them. — Eugene, Sep 07 '17 at 19:29
2) the second part about how i test this has two shortcomings indeed: first the one that you said that these `jmh` tests are hard to get right, especially when you want to measure something like in this example and second is that my knowledge of how a thread is *actually* scheduled or de-scheduled (and some other internal details) are coming mainly from *lots* of reading, but no practical examples. So this makes my arguments/questions more educated guesses unfortunately. But again thank you for your time answering this. — Eugene, Sep 07 '17 at 19:33
@Eugene If you want to understand the details of thread scheduling, there are two source I recommend. First, consult online slides and lectures from some leading CS departments' OS Design courses. Second, go explore an actual OS scheduler. For one of my college courses, we had to implement a Linux 2.6+ style preemptive scheduler in a vanilla Linux 2.4 kernel. It's a fun project, and you'd learn a lot. — Mike Strobel, Sep 08 '17 at 11:57

herm · Answer 4 · 2017-09-07T16:07:23.403

First of all the code you are writing is java which will create java byte code which translates to different atomic operations on different instruction sets (Arm vs powerpc vs X86...) which can behave different on implementations by different vendors and even between architectures from the same vendor (such as intel core 2 duo and skylake). So its really hard to answer your question!

This paper states that for the tested X86 architectures one execution of any atomic operation perform similarly (very small differnce between CAS, Fetch and add, swap) while CAS may fail and need to be executed multiple times. In case of one thread it will never fail however.

This stackoverflow post states:

Each object has a monitor associated with it. The thread that executes monitorenter gains ownership of the monitor associated with objectref. If another thread already owns the monitor associated with objectref, the current thread waits until the object is unlocked, then tries again to gain ownership. If the current thread already owns the monitor associated with objectref, it increments a counter in the monitor indicating the number of times this thread has entered the monitor. If the monitor associated with objectref is not owned by any thread, the current thread becomes the owner of the monitor, setting the entry count of this monitor to 1.

Lets look at the necessary operations in case of CAS:

public final int updateAndGet(IntUnaryOperator updateFunction) {
    int prev, next;
    do {
        prev = get();
        next = updateFunction.applyAsInt(prev);
    } while (!compareAndSet(prev, next));
    return next;
}

Fetch x, multiply x, Cas on x, check if cas succeeded

Now this is efficient in a non contented case because there is a minimal number of operations needed. But in case the cacheline is contended it is not very efficient because all threads are actively spinning while most of them fail. Furthermore I remember that spinning on a contended cacheline with an atomic operation is very expensive.

Now the important part in synchronize is:

If another thread already owns the monitor associated with objectref, the current thread waits until the object is unlocked

It depends on how this waiting is implemented.

the synchronized method could put the thread to sleep for a random time after failing to acquire the monitor, Also instead of using an atomic operation to check if the monitor is free it can do it with a simple read (this is quicker but I couldn't find a link to prove it).

My bet is that the waiting in synchronized is implemented in a smart way and optimized for situations with contention with one of the methods above or something similar and therefore it is quicker in the contended scenario.

The tradeof is that in non contended situations it is slower.

I still admit I have no proof.

This is really just a lot of assumptions. Can you back any of your "obvious" statements with hard data, or a maybe properly written benchamark? — Oleg Estekhin, Sep 07 '17 at 14:14
And how would these benchmarks differ from the ones of the author? CAS does a very specific thing. Read x, increment locally, atomic cas (oldx,x). Synchronize does more things (acquire monitor (with an atomic operation), Read x, increment locally, write x, release monitor) which explains why its slower in the sequential case. This is backed up by the link to a stack overflow post and if you google CAS you'll get a source for how that works. Why cas is slower in the contended case is speculation as I have not checked how the "waiting is implemented" exactly. prove me worng — herm, Sep 07 '17 at 14:44
Basically correct, besides that “CAS on monitor” can fail, like any other CAS, so you would need a subsequent “has CAS succeeded?”, but on the other hand, the entire “*read monitor, is monitor free?, CAS on monitor, has CAS succeeded?*” can be replaced by a single optimistic “CAS on monitor”, followed by “has CAS succeeded?”… — Holger, Sep 07 '17 at 15:28
You are right @Holger. Both cas and synchronize need to check if they succeded. And Oleg you are also right I make too many assumptions because I don't know how synchronize is implemented exactly. — herm, Sep 07 '17 at 15:56

CAS vs synchronized performance

4 Answers4

Linked