Increased cost of a volatile write over a nonvolatile write

Question

I've been reading about volatile (https://www.ibm.com/developerworks/java/library/j-jtp06197/) and came across a bit that says that a volatile write is so much more expensive than a nonvolatile write.

I can understand that there would be an increased cost associated with a volatile write given that volatile is a way of synchronization but want to know how exactly how a volatile write is so much more expensive than a nonvolatile write; does it perhaps have to do with visibility across different thread stacks at the time at which the volatile write is made?

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

1

Here's why, according to the article you have indicated:

Volatile writes are considerably more expensive than nonvolatile writes because of the memory fencing required to guarantee visibility but still generally cheaper than lock acquisition.

[...] volatile reads are cheap -- nearly as cheap as nonvolatile reads

And that is, of course, true: memory fence operations are always bound to writing and reads execute the same way regardless of whether the underlying variable is volatile or not.

However, volatile in Java is about much more than just volatile vs. nonvolatile memory read. In fact, in its essence it has nothing to do with that distinction: the difference is in the concurrent semantics.

Consider this notorious example:

volatile boolean runningFlag = true;

void run() {
  while (runningFlag) { do work; }
}

If runningFlag wasn't volatile, the JIT compiler could essentially rewrite that code to

void run() {
   if (runningFlag) while (true) { do work; }
}

The ratio of overhead introduced by reading the runningFlag on each iteration against not reading it at all is, needless to say, enormous.

edited Jun 20 '20 at 09:12

Community

1
1

answered Mar 06 '14 at 12:14

Marko Topolnik

195,646
29
319
436

So loosely (taking into the bits about distributed caching mentioned by Peter and DjDexter5GHz): – Zain Tofie Mar 07 '14 at 07:48
So loosely (taking into account the bits about distributed caching mentioned by @Peter and @DjDexter5GHz): `runningFlag` is on the heap in primary memory; a change to `runningFlag` means a write to primary memory, then secondary and tertiary memory if required and then a load of `runningFlag` again from (the most) tertiary or primary memory? Will a memory fence be constructed for each write to the different memory locations or would it be one memory fence in primary memory under which updates to all values of `runningFlag` are coordinated? – Zain Tofie Mar 07 '14 at 08:39
Also every volatile read of `runningFlag` implies at least one or more writes to tertiary/secondary memory stores even if `runningFlag` remains unchanged in primary memory? So it seems that the cost of volatile is based on the fact that so many more instructions are needed to exact guarantees but the underlying memory and processor architectures - specifically modern multi-core processor architectures also contribute non-trivially to the cost of volatile? – Zain Tofie Mar 07 '14 at 08:39
Your understanding is still not sufficient. On Intel CPU architecture, for example, a memory fence is not needed at all because the CPU itself guarantees proper visibility and ordering of writes. So it's just a single `mov` instruction, and on a different architecture it will at most take one additional memory fence instruction. – Marko Topolnik Mar 07 '14 at 08:52
I don't follow your reasoning where you say that a "volatile read implies at least one write". No writing is implied, just a plain RAM read (again a single `mov` instruction on x86). – Marko Topolnik Mar 07 '14 at 08:53
Ah ok thanks (can you point me to reading material?); about the implied writes - my assumption there spoke to the need for ensuring consistency but if it's a plain RAM read... – Zain Tofie Mar 07 '14 at 09:14
This, too: http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf – Marko Topolnik Mar 07 '14 at 09:27

score 0 · Answer 2 · answered Mar 06 '14 at 12:10

It is about caching. Since new processors use caches, if you don't specify volatile data stays in cache and operation of writing is fast. (Since cache is near processor) If variable is marked as volatile, system needs to write it fully into memory nad that is a bit slower operation.

And yes you are thinking right it has to do something with different thread stacks, since each is separate and reads from SAME memory, but not necessarily from same cache. Today processors use many levels of caching so this can be a big problem if multiple threads/processes are using same data.

EDIT: If data stays in local cache other threads/processes won't see change until data is written back in memory.

score 0 · Answer 3 · answered Mar 06 '14 at 12:11

0

Most likely it has to do with the fact that a volatile write has to stall the pipeline.

All writes are queued to be written to the caches. You don't see this with non-volatile writes/reads as the code can just get the value you just wrote without involving the cache.

When you use a volatile read, it has to go back to the cache, and this means the write (as implemented) cannot continue under the write has been written to the case (in case you do a write followed by a read)

One way around this is to use a lazy write e.g. AtomicInteger.lazySet() which can be 10x faster than a volatile write as it doesn't wait.

answered Mar 06 '14 at 12:11

Peter Lawrey

525,659
79
751
1,130

According to my understanding, `lazySet` actually satisfies the `volatile` requirements in full. The difference is only in the *timeliness* of the visibility of the write, which is unspecified for `volatile`. – Marko Topolnik Mar 07 '14 at 08:23
AFAIK The write is visible to other threads at the same time. What is different about lazySet is it might not he visible to the same thread yet. – Peter Lawrey Mar 07 '14 at 21:30
Yes, that's the funny thing about `volatile`: everybody *assumes* the write will be visible immediately (or *synchronously*, to give this concept more substance), and it is indeed visible immediately on all known implementations; yet that is not what the semantics of `volatile` include. – Marko Topolnik Mar 07 '14 at 21:32
@MarkoTopolnik for most of the processors I have tested it takes between 20 and 50 ns to be usable to another core, on average depending on how you do it. In truth, everything takes some time, so "immediate" can only mean *really* fast. When you look out a window, you are actually looking into the past as light doesn't travel immediately. – Peter Lawrey Mar 07 '14 at 21:53
1

Yes, and `lazySet` gives you exactly the same thing, only with a potentially longer delay. So the semantics of `lazySet` are fully compliant with the semantics of a volatile write, IMO. That's what's strange about `lazySet`: it implies that `volatile` is about something more than that, but actually isn't. – Marko Topolnik Mar 07 '14 at 21:57
@MarkoTopolnik Good point, I was talking about the behaviour on x86 based systems, on other processors you could see different behaviour. – Peter Lawrey Mar 07 '14 at 21:59
1

I am talking about the specified semantics of `volatile` and `lazySet`, so the argument is universal. – Marko Topolnik Mar 07 '14 at 21:59
If you'd like, check out [my first question on this site](http://stackoverflow.com/questions/11761552/guarantees-given-by-the-java-memory-model), it deals with exactly this issue in some depth. – Marko Topolnik Mar 07 '14 at 22:04

Increased cost of a volatile write over a nonvolatile write

3 Answers3