How to make a group of statements atomic without memory visibility effects?

Question

synchronized blocks let me make a group of statements atomic, while ensuring there is a happens-before relationship between block exit and enter.

I read that the biggest cost of synchronization are the memory visibility guarantees, not the lock contention.
Let's say I am able to guarantee memory visibility via other means:

How do I make a group of statements atomic, without creating an happens-before relationship, that is without the memory visibility effects of synchronized/Lock?

I tried to implement a lock in user space via CAS, but the built-in one severely outperforms it, and memory barriers are still emitted by the CAS variable.

In this example, a mutex without memory visibility effects would suffice.

(release/acquire) int x; // Variable with release/acquire semantics

// release fence
synchronized (this) {

    int y = x;
    // acquire fence

    // release fence
    x = 5;

}
// acquire fence

The same set of fences is emitted twice (by the mutex and x). Does it cause unnecessary overhead?

Is a lock without memory effects theoretically possible?
Would a lock without memory effects actually be more performant?
Is there a built-in way to accomplish this in C++ and/or Java?
If there isn't, can it be implemented in C++ and/or Java?

@FrancescoMenzani What is exprensive in synchronization is the blocking of threads. Blocking of threads enforces the CPU to switch threads, which is expensive. Compared to those costs, visibility costs are very small. When accessing data from CPU registers costs 1 nanosecond, then accessing stuff from the memory maybe costs 10 nanoseconds. Switching threads costs way more than that because it involves many more operations. — akuzminykh, May 08 '20 at 01:29
I'm a bit lazy to research the specific JLS parts but I think every king of mechanism in Java that involves synchronzation, which is what have to be done when you want to make something atomic, involves visibility effects. This has to do with [memory barriers](https://www.infoq.com/articles/memory_barriers_jvm_concurrency/). The "happens before relationsship" is nothing but an abstraction from a memory barrier. Accessing volatile variables is a memory barrier. Entrance and exit of synchronized blocks are memory barriers. And many more things. — akuzminykh, May 08 '20 at 01:35
@akuzminykh In certain applications context switches can be avoided by pinning threads to specific CPU cores. Can you provide the source for your estimate? — spongebob, May 08 '20 at 13:27
@FrancescoMenzani *"In certain applications context switches can be avoided by pinning threads to specific CPU cores."*, ok, good luck with that. [Here](https://www.agner.org/optimize/) you can find many interesting things. [Here](https://stackoverflow.com/questions/4633866/is-volatile-expensive) is an easier starting point where you'll find details about why `volatile` is very cheap. It's obvious that `synchronized` is more expensive as it involves blocking/switching of threads. — akuzminykh, May 08 '20 at 14:25
There is `AtomicReferenceVarHanlde.weakCompareAndSetPlain`, but I would believe a standard lock would outperform naive code. — Tom Hawtin - tackline, May 09 '20 at 22:57
This seems very much like a [premature optimization](http://wiki.c2.com/?PrematureOptimization); do you have benchmarks or data to suggest that the effort you're expending on this issue is worth the trouble? — dimo414, May 10 '20 at 05:13
@dimo414 I wouldn't call it premature optimization because performance may be a feature and I think this is broadly applicable to any case that fits. After reading a (possibly flawed) answer stating that the main cost of a mutex are its memory guarantees, and realizing some logic I've written does not need those guarantees, I came up with this question. I don't have benchmarks because I did not find out how to implement this in the first place, which is why I asked whether theoretically it would be an improvement. — spongebob, May 10 '20 at 13:09
It's a question of *how much* performance you'll squeeze out of this. While I agree performance is a worthwhile goal it comes with tradeoffs, primarily around maintainability and compatibility. It may be an interesting academic question, but without data I'd assume that the JVM is already able to make many appropriate optimizations, and trying to outsmart the JVM is likely to be a painful and low-reward exercise. — dimo414, May 10 '20 at 18:06

score 6 · Answer 1 · edited May 17 '20 at 10:14

The costs of guaranteeing memory visibility in a mutex is negligible, in fact on x86 it is free.

Acquiring a mutex requires an atomic read-modify-write operation with acquire semantics. For releasing a mutex it is sufficient to use a simple store with release semantics. Consider a simple spin-lock - the acquire operation consists of a loop that repeatedly tries to set a lock flag to 1 if it is currently 0. To release the lock, the owning thread simply writes 0 in to lock flag. In many regards, such a simple spin-lock is far from optimal, and there are many designs for locks that try to improve that (e.g., fairness, spinning on local cache lines etc.), but in all these designs releasing the lock is certainly cheaper than acquiring it.

The x86 memory model is pretty strong: all atomic read-modify-write operations are sequentially consistent, all store operations have effectively release-, and all load operations acquire semantics. That's why on x86 releasing a mutex can be done with a normal store, no additional instructions are required to ensure visibility of memory effects. On architectures with weaker memory models like ARM or Power you do need additional instructions, but the cost is negligible compared to the cost of the acquire-operation. x86 also has special barrier instructions, but those are usually only relevant in certain cases in lock-free programming, and the cost of these instructions is about the same as some atomic read-modify write.

The real cost of a mutex is not the visibility of memory effects, but contention and the serialization of the execution. If the number of threads competing for the mutex is low, and the duration for which a thread holds the mutex is also low, then the overall performance impact will also be low. But if the number of threads fighting for the mutex is large, and the duration for which a thread holds the mutex is also large, then other threads will have to wait longer until they can finally acquire the mutex and continue execution. This reduces the work that can be performed within a given time frame.

I am not sure what you mean by "Is a lock without memory effects theoretically possible?". The whole purpose of a mutex is to allow some operations to be performed - and also observed - as if they were atomic. This implies that the effect of the operation becomes visible to the next owner of the mutex. This is actually what the happens-before relation guarantees. If a thread A acquires the mutex, and this acquire operation happens-after a release operation by some thread B, then due to the transitivity of the happens-before relation, the operations performed by B while holding the mutex must have happened before the operations A is about to perform - and that means all memory effects have to be visible. If this is not guaranteed, then your mutex is broken and you have a race condition.

Regarding the volatile variable in your example - the Java memory model requires that all operations on shared volatile variables are sequentially consistent. However, if x is only ever accessed inside a critical section (i.e., protected by some mutex), then it does not have to be volatile. Volatile is only needed if some threads access the variable without any other synchronization mechanisms like a mutex.

The release/acquire semantics of the mutex operations are necessary to order the operations inside the mutex. In C++ one could implement a mutex using relaxed operations. The lock/unlock operations on the mutex itself would still be totally ordered (due to the modification order of the mutex), but we would lose the happens-before relation, so the operations inside the mutex would be unordered. While this would be possible in C++ it would be rather absurd, because as I tried to explain, making the memory effects visible is very cheap (on x86 it is free), but you would lose a property that is absolutely crucial in virtually all cases. Note: the store operation to release the mutex is cheaper than a store to a volatile variable. Volatile variables are sequentially consistent, but releasing a mutex can be done with a release-store. (Of course the Java memory model is not as flexible as the C++ model, so you cannot really implement a hand-knitted lock using more relaxed acquire/release operations).

I know that volatile semantics would not be needed for `x` in my example. I made it volatile to show a case where the memory guarantees of the mutex are not needed. I've edited the example for more clarity. By asking whether a lock without memory effects is theoretically possible, I mean if a lock can be implemented without memory fences or not, for example because memory fences are needed to publish the lock state between threads. — spongebob, May 10 '20 at 13:39
In the case where memory visibility in the critical section is guaranteed by other means, do you think a mutex without happens-before relationship would not be enough? I think that such a mutex could be used just to guarantee atomicity. — spongebob, May 10 '20 at 13:40
Why would "it be rather absurd, because you would lose a property that is absolutely crucial in virtually all cases"? Consider reading a variable in acquire mode just after locking the mutex and writing that same variable in release mode just before unlocking the mutex: wouldn't that suffice to create an happens-before relation and make the operations inside the mutex ordered, in the case where the mutex itself provided only the atomicity guarantee? — spongebob, May 10 '20 at 17:27
Yes, it would be enough, but why would you want to do that? You would end up with two additional stores, only to achieve the same guarantee that the mutex would out of the box. — mpoeter, May 10 '20 at 20:55
Because I need to read the variable without holding the lock. Right now I made the read inside the mutex plain, but realized I could also keep it in acquire mode if I could remove the memory guarantees of the mutex, hence this question. — spongebob, May 10 '20 at 21:30
Yes, theoretically it is possible to implement a "kind of lock" using relaxed operations, but it gains you virtually nothing (if anything at all - again, on x86 you gain _actually nothing_). If you find yourself using a volatile variable inside a lock there might be a chance that either the variable does not actually have to be volatile (because it is never accessed outside the lock), or you might be able to get rid of the lock completely. Whether this is possible or not of course depends on the situation. — mpoeter, May 11 '20 at 08:10
In your example above you could use an `AtomicInt` and [`getAndSet`](https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/atomic/AtomicInteger.html#getAndSet(int)) to avoid the lock, but I suppose your actual use case is not that simple. However, if you describe your actual use case in more detail I might be able to help. — mpoeter, May 11 '20 at 08:10
Of course that is just an example. Your answer is already good. I described my actual use case, and I am open to suggestions unrelated to this specific question. — spongebob, May 11 '20 at 12:13
So you are basically trying to implement a concurrent queue. What are your requirements? Single producer/single consumer (SPSC), mulitple producers/multiple consumers (MPMC), or a mix (MPSC/SCMP)? Should it be bounded or unbounded? Do you need blocking operations or is a `tryPop` sufficient (i.e., return false on an empty queue). But this discussion in probably outside the scope of the original question. You might want to open a new one? — mpoeter, May 11 '20 at 13:36
Sure, let's continue this discussion via mail. But if you are satisfied with my answer, it would be great if you could accept it. ;) — mpoeter, May 12 '20 at 08:16

spongebob · Answer 2 · 2020-08-11T17:33:26.630

2

A busy-waiting lock without an happens-before relationship between exit and enter can be implemented like this in Java:

private static final VarHandle STATE = ...;

private boolean state;

void lock() {
    while ((boolean) STATE.getAndSetRelease(this, true)) {
        while (state) {
            Thread.onSpinWait();
        }
    }
}

void unlock() {
    STATE.setOpaque(this, false);
}

In the above code, Thread.onSpinWait() on x86 prevents state from being cached indefinitely. On architectures where that is not the case, the following could be used instead:

while ((boolean) STATE.getOpaque(this)) {}

edited Aug 11 '20 at 17:33

answered Aug 11 '20 at 16:59

spongebob

8,370
15
50
83

I believe that with `getAndSetRelease` (Release/Acquire mode) a happens-before relationship is exactly what you establish. Instead, you'd wanna do opaque set/get. Which, also takes out that problem with `state` being cached indefinitely. At least, that is my understanding from reading [this page](http://gee.cs.oswego.edu/dl/html/j9mm.html). Here, Doug Lea in his own words associates the RA mode with "causal 'happens-before' consistency" needed in "producer consumer designs". Please correct me if I am wrong! Many thanx for your example. – Martin Andersson Apr 12 '21 at 06:33
@MartinAndersson There is no way to do get-and-set using opaque semantics in Java. Note that `getAndSetRelease()` reads the value using plain semantics, and writes it using release semantics. – spongebob Apr 12 '21 at 08:11

score 1 · Answer 3 · edited May 17 '20 at 10:15

I was asking exactly the same question a while ago.

I solved for my particular use case with a simple piece of code. Other use cases will have different optimum solutions.

Example Use Case 1: Hot loop that needs to check an atomic

Assuming your use case is something like this: a hot loop has to spin as fast as possible, and it can't afford to be continuously checking an atomic (or volatile) variable as this involves synchronising state between two processor cores.

The solution is remarkably simple: only check every 1024 iterations. Why 1024? It's a power of 2, so any MOD operators get optimised to a fast bitwise AND calculation. Any other power of 2 will do. Tune to suit.

After that, the overhead of an atomic becomes negligible compared to the work the loop is performing.

It would be possible to implement other more complicated solutions. But these lines are sufficient:

// Within a hot loop running on a single core ...
int32_t local = 0;
if (local % 1024 == 0) // Optimised to a bitwise AND (checks if lower N bits are 0).
{ 
  // Check atomics, which may force the processor to synchronize state with another core.
}

Example Use Cases 2....N: The Realtime Bible

There is an excellent talk on various levels of locking for other use cases, see: Real time 101 - David Rowland & Fabian Renn Giles - Meeting C++ 2019.

Q. Is a lock without memory effects theoretically possible?

A. Yes. If two threads were pinned to the same core, then both threads could share state using registers. If two threads are running on different cores, they could share state using L1, L2 or L3 cached memory if they are next to each other on the die. If two threads are running on different cores, then usually they communicate shared state using a flush to RAM.

Q. Would a lock without memory effects actually be more performant?

A. Yes. However, this is really only applicable to embedded operating system or device drivers, see update below.

Q. Is there a built-in way to accomplish this in C++ and/or Java?

A. No. But the answers above can get very close, often by avoiding locks altogether for most of the time. Better to spend time to master a good profiler.

Q. If there isn't, can it be implemented in C++ and/or Java?

Q. No. Not realistic in a high level language, see update below.

Hardware interrupts are entirely distinct from software interrupts. A hardware interrupt causes the processor Intruction Pointer (IP) to switch to execute another service routine. So a lock without memory effects is theoretically possible if there are many "threads" running on a single core, and the "threads" (i.e. the Interrupt Service Routines triggered by hardware interrupts) communicate via registers in the CPU, or at the very least by an internal cache (L1, L2, L3), as this does not end up hitting RAM.

On a practical level, this is probably not relevant to any high level languages such as C++ or Java, and it is probably not relevant to user-mode processes in high-level operating systems such as Linux or Windows. It is probably only possible when using an embedded OS such as QMX, or perhaps when writing kernel-mode device drivers for Windows or Linux.

So in practice, a reasonable rule of thumb is to just assume that all locks have memory effects. If one is concerned about performance, run a profiler. If there are performance issues, choose from the selection of threading architectures in said Realtime Bible.

How exactly should two threads communicate via _registers_? That is simply not possible! — mpoeter, May 14 '20 at 17:14
@mpoeter Remember - theoretically possible. Hand-coding threads in assembly would do it. I used to work on embedded systems, I wrote a simple OS that did just that, handled many tasks with shared state in registers. — Contango, May 14 '20 at 20:21
I am curious about what architecture that was. I am not aware of any mainstream ISA that would support that - not even with hand written assembly. — mpoeter, May 14 '20 at 20:52
I think what you did in your Example Use Case 1 can be better achieved using the PAUSE x86 instruction (`Thread.onSpinWait()` in Java and `_mm_pause()` in C++). I understand that locks can be avoided by making use of atomic instructions, however that is not the point of my question. You can find my particular use case in the revision history. Thanks for the link to the talk. — spongebob, May 14 '20 at 21:13
@mpoeter Microchip PIC. Seriously low level, microcontroller with a few tens of kilobytes of flash and ram. Hardware interrupts triggered ISR (Interrupt Service Routines) to execute multiple tasks in parallel. It is theoretically possible to do the same thing on X86 architecture, but it would probably require some sort of embedded OS, perhaps something other than Windows or Linux. See https://www.embeddedarm.com/software/solutions-x86, e.g. QNX. — Contango, May 15 '20 at 09:38
That sounds more like _concurrent_ execution instead of _parallel_ execution. If you have two threads running on two separate cores I see no way how these could communicate via _registers_ as this would mean that one core can modify the register content of some other core. — mpoeter, May 15 '20 at 10:01
@mpoeter Realistically, two threads cannot communicate via registers unless they are running on the same core. Thats why I specified "If two threads were pinned to the same core". And practically, probably not realistic - theoretical only, limited to an embedded OS. — Contango, May 15 '20 at 10:37
Yes, but then we are talking about _concurrent_ execution not _parallel_. And in this scenario spinning is the worst possible solution, since there is not way that the lock can be released by the other thread as long as the spinning thread is occupying the core. — mpoeter, May 15 '20 at 10:50
@mpoeter Sigh. You're forgetting about hardware interrupts. They are entirely distinct from software interrupts. And you're forgetting about hardware sleep commands - I never said anything about spinning, just communciation. See: https://en.wikipedia.org/wiki/HLT_(x86_instruction) — Contango, May 16 '20 at 11:06

How to make a group of statements atomic without memory visibility effects?

3 Answers3

Example Use Case 1: Hot loop that needs to check an atomic

Example Use Cases 2....N: The Realtime Bible