Why does Unsafe.fullFence() not ensuring visibility in my example?

Question

I am trying to dive deep into volatile keyword in Java and setup 2 testing environments. I believe both of them are with x86_64 and use hotspot.

Java version: 1.8.0_232
CPU: AMD Ryzen 7 8Core

Java version: 1.8.0_231
CPU: Intel I7

Code is here:

import java.lang.reflect.Field;
import sun.misc.Unsafe;

public class Test {

  private boolean flag = true; //left non-volatile intentionally
  private volatile int dummyVolatile = 1;

  public static void main(String[] args) throws Exception {
    Test t = new Test();
    Field f = Unsafe.class.getDeclaredField("theUnsafe");
    f.setAccessible(true);
    Unsafe unsafe = (Unsafe) f.get(null);

    Thread t1 = new Thread(() -> {
        while (t.flag) {
          //int b = t.someValue;
          //unsafe.loadFence();
          //unsafe.storeFence();
          //unsafe.fullFence();
        }
        System.out.println("Finished!");
      });

    Thread t2 = new Thread(() -> {
        t.flag = false;
        unsafe.fullFence();
      });

    t1.start();
    Thread.sleep(1000);
    t2.start();
    t1.join();
  }
}

"Finished!" is never printed which does not make sense to me. I am expecting the fullFence in thread 2 makes the flag = false globally visible.

From my research, Hotspot uses lock/mfence to implement fullFence on x86. And according to Intel's instruction-set reference manual entry for mfence

This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction.

Even "worse", if I comment out fullFence in thread 2 and un-comment any one of the xxxFence in thread 1, the code prints out "Finished!" This makes even less sense, because at least lfence is "useless"/no-op in x86.

Maybe my source of information contains inaccuracy or i am misunderstanding something. Please help, thanks!

Peter Cordes · Accepted Answer · 2023-08-22T21:26:22.020

It's not the runtime effect of the fence that matters, it's the compile-time effect of forcing the compiler to reload stuff.

Your t1 loop contains no volatile reads or anything else that could synchronize-with another thread, so there's no guarantee it will ever notice any changes to any variables. i.e. when JITing into asm, the compiler can make a loop that loads the value into a register once, instead of reloading it from memory every time. This is the kind of optimization you always want the compiler to be able to do for non-shared data, which is why the language has rules that let it do this when there's no possible synchronization.

And then of course the condition can get hoisted out of the loop. So with no barriers or anything, your reader loop can JIT into asm that implements this logic:

if(t.flag) {
   for(;;){}  // infinite loop
}

Besides ordering, the other part of Java volatile is the assumption that other threads may change it asynchronously, so multiple reads can't be assumed to give the same value.

But unsafe.loadFence(); makes the JVM reload t.flag from (cache-coherent) memory every iteration. I don't know if this is required by the Java spec or merely an implementation detail that makes it happen to work.

If this was C++ with a non-atomic variable (which would be undefined behaviour in C++), you'd see exactly the same effect in a compiler like GCC. _mm_lfence would also be a compile-time full-barrier as well as emitting a useless lfence instruction, effectively telling the compiler that all memory might have changed and thus needs to be reloaded. So it can't reorder loads across it, or hoist them out of loops.

BTW, I wouldn't be so sure that unsafe.loadFence() even JITs to an lfence instruction on x86. It is useless for memory ordering (except for very obscure stuff like fencing NT loads from WC memory, e.g. copying from video RAM, which the JVM can assume isn't happening), so a JVM JITing for x86 could just treat it as a compile-time barrier. Just like what C++ compilers do for std::atomic_thread_fence(std::memory_order_acquire); - block compile time reordering of loads across the barrier, but emit no asm instructions because the asm memory of the host running the JVM is already strong enough.

In thread 2, unsafe.fullFence(); is I think useless. It just makes that thread wait until earlier stores become globally visible, before any later loads/stores can happen. t.flag = false; is a visible side effect that can't be optimized away so it definitely happens in the JITed asm whether there's a barrier following it or not, even though it's not volatile. And it can't be delayed or merged with something else because there's nothing else in the same thread.

Asm stores always become visible to other threads, the only question is whether the current thread waits for its store buffer to drain or not before doing more stuff (especially loads) in this thread. i.e. prevent all reordering, including StoreLoad. Java volatile does that, like C++ memory_order_seq_cst (by using a full barrier after every store), but without a barrier it's still a store like C++ memory_order_relaxed. (Or when JITing x86 asm, loads/stores are actually as strong as acquire/release.)

Caches are coherent, and the store buffer always drains itself (committing to L1d cache) as fast as it can to make room for more stores to execute.

Caveat: I don't know a lot of Java, and I don't know exactly how unsafe / undefined it is to assign a non-volatile in one thread and read it in another with no synchronization. Based on the behaviour you're seeing, it sounds exactly like what you'd see in C++ for the same thing with non-atomic variables (with optimization enabled, like HotSpot always does)

(Based on @Margaret's comment, I updated with some guesswork about how I assume Java synchronization works. If I mis-stated anything, please edit or comment.)

In C++ data races on non-atomic vars are always Undefined Behaviour, but of course when compiling for real ISAs (which don't do hardware race-prevention) the results are sometimes what people wanted.

PS: just using barriers to force a compiler to re-read a value isn't safe in general: it could choose to re-read the value multiple times even if the source copies it to a local variable. So the same tmp var might seem to be both true and false in one execution. At least that's true in C and C++ because data races are undefined behaviour in those languages; see Who's afraid of a big bad optimizing compiler? on LWN about this and other problems you'd run into if you just use barriers and plain (non-volatile) variables. Again, I don't know if that's a possible problem in Java or if the language spec would forbid a JVM from inventing loads after int tmp = shared_plain_int; if tmp is used multiple times across function calls.

I believe your answer is 100% correct. Java "works" in terms of sync-with relationship and the two threads don't sync on anything in the first place. The OP may just have thought of a barrier as "push data to other CPUs", which seems to be a common pitfall. — Margaret Bloom, Jan 11 '20 at 13:26
@MargaretBloom: thanks, updated my answer with an explanation in those terms of why a read loop that has no possible synchronization points is a problem. If you could give it a look to check that I didn't make any wrong assumption or misleading statements, that would be great. — Peter Cordes, Jan 11 '20 at 19:28
It seems fine to me. One could probably pin down the exact line the Java spec but I don't see the point as it's a lot of work. You correctly spotted the main issue and there no need for more technical details, in my opinion. — Margaret Bloom, Jan 11 '20 at 20:50
Addendum and corrections: of course, the effect of `unsafe.[…]Fence()` is entirely unspecified, as even the existence of this implementation specific class `Unsafe` is not mentioned in the spec. With Java 9’s introduction of `VarHandle`, we have an official API for fences in Java for the first time, however, it has been forgotten to hand out an actual specification (especially regarding the interaction with other Java constructs), the doc vaguely points to similarities to C and C++. — Holger, Jan 13 '20 at 13:39
The statement `t.flag = false;` is not a mandatory side effect (there is no such thing in Java), but since the thread terminates after that, it has the effect of a barrier already. In terms of Java’s specification, there is a *happens-before* relationship between the thread’s end and every other thread detecting the thread’s end, which would guaranty the visibility of what the thread did. While not guaranteed, a thread looping with the right fence will eventually notice the terminated thread’s effects, even when not reading the thread’s running state. — Holger, Jan 13 '20 at 13:42

score 0 · Answer 2 · answered Aug 22 '23 at 20:53

Not attempting to dispute in-depth, low-level, ASM- and JIT-geared accepted answer, given by @Peter Cordes, I'd like to come with Java-only explanations of the behavior the OP described. If the OP wants to emulate volatile features for his private boolean flag field, then according to Alexey Shipilev's blog On The Fence With Dependencies, the code is missing LoadLoad and LoadStore on loading the volatile-hopeful field flag:

The second part of that is the volatile load:

int t = x; // volatile load 
[LoadLoad] 
[LoadStore] 
<other ops>

(x variable in the blog plays a role of flag field in the question's code).

So, according to the blog, there is no wonder that flag field in the question's code does not render the volatile behavior.

Then the blog goes on, explaining the implementation of the barrier instructions on various hardware, but, I think, @Peter Cordes discussed this side of the story in great details.

Obviously, this answer is slightly "formal" against the spirit of the question, answer, and comments, but hopefully it can shed some light on the issue.

Why does Unsafe.fullFence() not ensuring visibility in my example?

2 Answers2

Linked

Related