1

From what I've read, the "volatile" keyword in java ensures that a thread always fetches the most up-to-date value of a particular pointer, usually by reading/writing directly from/to memory to avoid cache inconsistencies.

But why is this needed? To my knowledge, this is already done on a hardware level. If I remember correctly from my system architecture class, A processor-core that updates a memory location, sends an invalidation signal to the other processor's caches, forcing them to fetch those lines from memory when the time comes to do so. Or, if it was the other way around - if a processor fetches memory, it will force cached (but not written) lines of other caches to be flushed to memory first.

My only theory is that this actually has nothing to do with caches at all, despite all the explanations I've read. It has to do with that data in JVM can reside in two places - a thread's local stack and the heap. And that a java thread may use its stack as a kind of cache. I'll buy that, but that also means that using volatile on data that reside on the heap is useless, since it's shared by all threads and abides by hardware implemented coherence?

Eg:

public final int[] is = new int[10];

accessing the is's data will always result in getting the most up-to-date data, since the data resides on the heap. The pointer, however, is a primitive and might fall victim to the stack problem, but since it's final we doesn't have this problem.

Are my assumptions correct?

Edit: This is not a duplicate as far as I can tell. The alleged duplicate thread is one of those misleading answers that says that it has to do with cache coherence. My question is not what volatile is used for, nor how to use it. It's testing a theory and more in depth.

Jake
  • 843
  • 1
  • 7
  • 18

3 Answers3

5

But why is this needed? To my knowledge, this is already done on a hardware level.

This is incorrect. In a modern multi-core system, a simple memory write instruction does not necessarily write through to main memory (certainly, not immediately), and a memory read instruction is not guaranteed to read the latest value from main memory / other caches. If memory read / write instructions always did that, the memory caches would be a waste of time.

In order to guarantee those things, the (native code) compiler needs to emit instructions at key points that cause a cache write through or a cache invalidation.

My only theory is that this actually has nothing to do with caches at all ...

That is incorrect. It is everything to do with caches. The problem is you are misunderstanding how a typical instruction on a typical modern multi-core processor deals with the caches.

ISA's are designed so that the caches make single threaded code run fast ... by avoiding going to main memory. If only one thread is reading and writing the value at given address, the fact that the copy of the value in the processor's cache is newer than the copy in main memory doesn't matter.

But when there are multiple threads you can have two threads running on different cores, with different memory caches. If there are N cores, there could be N+1 different "versions" of a given address->value association. That is chaotic. There are two ways to deal with this in Java:

  • Declare the variable to be volatile which tells the compiler to use (expensive) cache flush and/or cache invalidation instructions (or instructions which flush or invalidate as a side-effect) to implement reads and writes.

  • Use proper synchronization, and rely on the happens-before relations to inform the compiler where to place the memory barriers. A typical synchronizing operation will also involve instructions that flush and / or invalidate the caches as required. That is typically what provides the memory barrier.

However, this is all ISA specific, and the particular machine instructions used will depend on the JIT compiler.


Reference:


The other thing to note is that there is another kind of caching that happens in a typical compiled program: caching of temporary variables in registers.

The Java Memory Model doesn't directly talk about the behavior of hardware: memory caches and registers. Rather it specifies the cases where a multi-threaded application's use of shared memory is guaranteed to work. But the underlying reason for this is far broader than "word tearing".

And finally the "expression stack" in the JVM abstract machine is really just an artifice for specifying the operational semantics. When the bytecodes have been compiled to native code, values are stored in either hardware machine registers or hardware memory locations. The expression stack no longer exists. The call stack / local frames exist, of course. But they are implemented as ordinary memory.

v.ladynev
  • 19,275
  • 8
  • 46
  • 67
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
3

Your assumptions are correct for int- actually for any data type which fits inside one JVM stack frame, which is 32 bits wide.

The same guarantee does not hold for longs, doubles, e.t.c which spill over. volatile tells the JVM that any load/store operations for a given value must not be reordered in any bytecode optimization in order to ensure that you don't get into a state where a large type consists of 32 bits from one write operation and 32 from a later one- even if that means using lock primitives at the CPU cache level. As long as you avoid a cache miss, reading a volatile value should behave pretty much the same as a regular one on x86 (although this is heavily architecture dependent).

Chris Mowforth
  • 6,689
  • 2
  • 26
  • 36
  • So if I understand you correctly, I need the volatile keyword for primitives (which include pointers), but not for what the pointer points to. So there is no need for an array with volatile content? – Jake Apr 19 '17 at 10:17
  • And what about 64-bit versions of JVM and machines? Are 64-bit primitives still pointers, or is that platform dependent? – Jake Apr 19 '17 at 10:18
  • *no*- not platform dependent and nothing to do with pointers- sizes are [here](https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html). If you have an array of `long`s you still have to load/store each element at the bytecode level, in which case you need volatile if you know multiple threads will be reading. – Chris Mowforth Apr 19 '17 at 10:20
  • ok, but stack frames are still 32 bits on a 64 machine and java version? – Jake Apr 19 '17 at 10:35
2

I've done some research and come up with the following:

A Volatile variable is affected in two ways.

Take this java example:

public int i = 0;
public void increment(){
   i++;
}

Without volatile, the JIT will issue the following psudo instructions in the increment method:

LOAD  R1,i-address
... arbitrary number of instructions, not involving R1
ADDI  R1,1
... arbitrary number of instructions, not involving R1
... this is not guaranteed to happen, but probably will:
STORE R1, i-address

Why the arbitrary instructions? because of optimization. the pipeline will be stuffed with instructions not involving R1 to avoid pipeline stalls. In other words, you get out of order execution. Rewriting i to memory will also be prevented it possible. If the optimizer can figure out that this is unnecessary, it won't do it, it might miss the fact that I is accessed from another thread though and by then i will still be 0.

When we change i to volatile, we get:

STEP 1

LOAD  R1,i-address
ADDI  R1,1
STORE R1, i-address

Volatile prevents out of order execution and will not try to stuff the pipeline to solve hazards. And will never store i locally, and by locally I mean in a register, or a stack frame. It will guarantee that any operation on i will involve a LOAD and a STORE of it, in other words fetching and writing to memory. Memory, however does not translate to main-memory, or RAM, or whatnot, it implies the memory-hierarchy. LOADS and STORES are used for all variables, volatile or not, but not to the same extent. How they are handled is up to the chip architects.

STEP 2

LOAD  R1,i-address
ADDI  R1,1
LOCK STORE R1, i-address

The lock instruction issues a memory barrier, meaning that any other thread trying to read or write on i's address will have to wait until the store operation has been completed. This ensures that the actual write-back of i is atomic.

Note though that the java line "i++" is not atomic. Things can still happen between the LOAD and STORE instruction. That's why you typically need explicit locks, which are implemented with volatiles to be able to truly make atomic operations on i. Take this example:

volatile int i = 0;

THREAD A
{
   for (int j = 0; j < 1000; j++)
       i++;
} 
THREAD B
{
   for (int j = 0; j < 1000; j++)
       i++;
} 

will produce unpredictable results of a multi-core processor. And needs to be solved like this:

private volatile int i = 0;

public synchronized incrementI(){
   i++;
}

Sources: https://docs.oracle.com/javase/tutorial/essential/concurrency/atomic.html source: https://docs.oracle.com/cd/E19683-01/806-5222/codingpractices-1/index.html

Conclusion: According to both Intel and AMD, cache consistency is managed by the hardware and thus volatile has nothing to do with caches. And the "volatiles are forced to live in main memory" is a myth. It does, however, probably indirectly cause additional cache invalidations, since STORE's are used more frequently.

I am open to the idea that volatile will cause a write-through on obscure architectures, though.

Jake
  • 843
  • 1
  • 7
  • 18
  • *"I suspect it has more to do with processor register size and primitive size. A 32-bit primitive will fit in a 32-bit register, but a 64 bit primitive must be loaded from memory in two sequential machine instructions."* - This is wrong. The memory data buses on a 64bit processor can fetch 8 bytes in a single memory operation: http://stackoverflow.com/questions/39182060/why-isnt-there-a-data-bus-which-is-as-wide-as-the-cache-line-size – Stephen C Apr 21 '17 at 07:31
  • *"I have not been able to find anything supporting Stephen C's claim ..."* - apart from all of the sources that you dismissed as "misleading answers" in your original question :-) – Stephen C Apr 21 '17 at 07:35
  • feel free to back up your claim. I'd like proof that cache invalidation is implicitly done by a running program, and not by the hardware. And its great that you can fetch 64 bits from memory in a single machine instruction on a 64-bit processor, I don't contradict that. But what about a 32-bit machine with 32-BIT REGISTERS. And just because you can read it, that doesn't mean you can apply an Atomic instruction on it. – Jake Apr 21 '17 at 08:09
  • I have. See my updated answer. Particularly read the reference – Stephen C Apr 21 '17 at 08:09