9

This question came to me after reading this answer.

Code example:

class Obj1 {
  int f1 = 0;
}

volatile Obj1 v1;
Obj1 v2;

Thread 1            | Thread 2 | Thread 3
-------------------------------------------------
var o = new Obj1(); |          |
o.f1 = 1;           |          |
v1 = o;             |          |
                    | v2 = v1; |
                    |          | var r1 = v2.f1;

Is (r1 == 0) possible?

Here object o:

  • first published safely: from Thread 1 to Thread 2 via the volatile field v1
  • then published unsafely: from Thread 2 to Thread 3 via v2

The question is: Can Thread 3 see o as partially constructed (i.e. o.f1 == 0)?

Tom Hawtin - tackline says it can: Thread 3 can see o as partially constructed, because there is no happens-before relation between o.f1 = 1 in Thread 1 and r1 = v2.f1 in Thread 3 due to unsafe publication.

To be fair, this surprised me: until that moment I thought the 1st safe publication is enough.
As I understand, effectively immutable objects (described in such popular books as Effective Java and Java Concurrency in Practice) are also affected by that problem.

The Tom's explanation seems perfectly valid to me according to happens-before consistency in the JMM.
But there is also the causality part in the JMM, which adds constraints on top of happens-before. So, maybe, the causality part somehow guarantees that the 1st safe publication is enough.
(I cannot say that I fully understand the causality part, but I think I would understand example with commit sets and executions).

So I have 2 related questions:

  1. Does Causality part of the JMM allow or forbid Thread 3 to see o as partially constructed?
  2. Are there any other reasons why Thread 3 is allowed or prohibited to see o as partially constructed?
  • Interesting question! I would say that since there's a hb relation between t1 and t2, then there cannot be an implicit non-hb relation between t1 and t3 since they communicate through t2. That would mean that another thread could subvert the hb relation between 2 other threads. – Erik Feb 16 '21 at 08:56
  • 1
    If I am not mistaken, this is at least the second time you do this: open a fantastic question, answer yourself with a fantastic answer, then dissapear. Pity, great pity. – Eugene Mar 01 '21 at 04:58
  • @Eugene I was thinking the same – dreamcrash Mar 01 '21 at 06:18

2 Answers2

7

Partial answer: how "unsafe republication" works on OpenJDK today.
(This is not the ultimate general answer I would like to get, but at least it shows what to expect on the most popular Java implementation)

In short, it depends on how the object was published initially:

  1. if initial publication is done through a volatile variable, then "unsafe republication" is most probably safe, i.e. you will most probably never see the object as partially constructed
  2. if initial publication is done through a synchronized block, then "unsafe republication" is most probably unsafe, i.e. you will most probably be able to see object as partially constructed

Most probably is because I base my answer on the assembly generated by JIT for my test program, and, since I am not an expert in JIT, it would not surprise me if JIT generated totally different machine code on someone else's computer.


For tests I used OpenJDK 64-Bit Server VM (build 11.0.9+11-alpine-r1, mixed mode) on ARMv8.
ARMv8 was chosen because it has a very relaxed memory model, which requires memory barrier instructions in both publisher and reader threads (unlike x86).

1. Initial publication through a volatile variable: most probably safe

Test java program is like in the question (I only added one more thread to see what assembly code is generated for a volatile write):

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Fork(value = 1,
    jvmArgsAppend = {"-Xmx512m", "-server", "-XX:+UnlockDiagnosticVMOptions", "-XX:+PrintAssembly",
        "-XX:+PrintInterpreter", "-XX:+PrintNMethods", "-XX:+PrintNativeNMethods",
        "-XX:+PrintSignatureHandlers", "-XX:+PrintAdapterHandlers", "-XX:+PrintStubCode",
        "-XX:+PrintCompilation", "-XX:+PrintInlining", "-XX:+TraceClassLoading",})
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Threads(4)
public class VolTest {

  static class Obj1 {
    int f1 = 0;
  }

  @State(Scope.Group)
  public static class State1 {
    volatile Obj1 v1 = new Obj1();
    Obj1 v2 = new Obj1();
  }

  @Group @Benchmark @CompilerControl(CompilerControl.Mode.DONT_INLINE)
  public void runVolT1(State1 s) {
    Obj1 o = new Obj1();  /* 43 */
    o.f1 = 1;             /* 44 */
    s.v1 = o;             /* 45 */
  }

  @Group @Benchmark @CompilerControl(CompilerControl.Mode.DONT_INLINE)
  public void runVolT2(State1 s) {
    s.v2 = s.v1;          /* 52 */
  }

  @Group @Benchmark @CompilerControl(CompilerControl.Mode.DONT_INLINE)
  public int runVolT3(State1 s) {
    return s.v1.f1;       /* 59 */
  }

  @Group @Benchmark @CompilerControl(CompilerControl.Mode.DONT_INLINE)
  public int runVolT4(State1 s) {
    return s.v2.f1;       /* 66 */
  }
}

Here is the assembly generated by JIT for runVolT3 and runVolT4:

Compiled method (c1)   26806  529       2       org.sample.VolTest::runVolT3 (8 bytes)
  ...
[Constants]
  # {method} {0x0000fff77cbc4f10} 'runVolT3' '(Lorg/sample/VolTest$State1;)I' in 'org/sample/VolTest'
  # this:     c_rarg1:c_rarg1
                        = 'org/sample/VolTest'
  # parm0:    c_rarg2:c_rarg2
                        = 'org/sample/VolTest$State1'
  ...
[Verified Entry Point]
  ...
                                                ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.VolTest::runVolT3@0 (line 59)

  0x0000fff781a60938: dmb       ish
  0x0000fff781a6093c: ldr       w0, [x2, #12]   ; implicit exception: dispatches to 0x0000fff781a60984
  0x0000fff781a60940: dmb       ishld           ;*getfield v1 {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.VolTest::runVolT3@1 (line 59)

  0x0000fff781a60944: ldr       w0, [x0, #12]   ;*getfield f1 {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.VolTest::runVolT3@4 (line 59)
                                                ; implicit exception: dispatches to 0x0000fff781a60990
  0x0000fff781a60948: ldp       x29, x30, [sp, #48]
  0x0000fff781a6094c: add       sp, sp, #0x40
  0x0000fff781a60950: ldr       x8, [x28, #264]
  0x0000fff781a60954: ldr       wzr, [x8]       ;   {poll_return}
  0x0000fff781a60958: ret

...

Compiled method (c2)   27005  536       4       org.sample.VolTest::runVolT3 (8 bytes)
  ...
[Constants]
  # {method} {0x0000fff77cbc4f10} 'runVolT3' '(Lorg/sample/VolTest$State1;)I' in 'org/sample/VolTest'
  # this:     c_rarg1:c_rarg1
                        = 'org/sample/VolTest'
  # parm0:    c_rarg2:c_rarg2
                        = 'org/sample/VolTest$State1'
  ...
[Verified Entry Point]
  ...
                                                ; - org.sample.VolTest::runVolT3@-1 (line 59)
  0x0000fff788f692f4: cbz       x2, 0x0000fff788f69318
  0x0000fff788f692f8: add       x10, x2, #0xc
  0x0000fff788f692fc: ldar      w11, [x10]      ;*getfield v1 {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.VolTest::runVolT3@1 (line 59)

  0x0000fff788f69300: ldr       w0, [x11, #12]  ;*getfield f1 {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.VolTest::runVolT3@4 (line 59)
                                                ; implicit exception: dispatches to 0x0000fff788f69320
  0x0000fff788f69304: ldp       x29, x30, [sp, #16]
  0x0000fff788f69308: add       sp, sp, #0x20
  0x0000fff788f6930c: ldr       x8, [x28, #264]
  0x0000fff788f69310: ldr       wzr, [x8]       ;   {poll_return}
  0x0000fff788f69314: ret

...

Compiled method (c1)   26670  527       2       org.sample.VolTest::runVolT4 (8 bytes)
 ...
[Constants]
  # {method} {0x0000fff77cbc4ff0} 'runVolT4' '(Lorg/sample/VolTest$State1;)I' in 'org/sample/VolTest'
  # this:     c_rarg1:c_rarg1 
                        = 'org/sample/VolTest'
  # parm0:    c_rarg2:c_rarg2 
                        = 'org/sample/VolTest$State1'
  ...
[Verified Entry Point]
  ...
                                                ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.VolTest::runVolT4@0 (line 66)

  0x0000fff781a604b8: ldr       w0, [x2, #16]   ;*getfield v2 {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.VolTest::runVolT4@1 (line 66)
                                                ; implicit exception: dispatches to 0x0000fff781a604fc
  0x0000fff781a604bc: ldr       w0, [x0, #12]   ;*getfield f1 {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.VolTest::runVolT4@4 (line 66)
                                                ; implicit exception: dispatches to 0x0000fff781a60508
  0x0000fff781a604c0: ldp       x29, x30, [sp, #48]
  0x0000fff781a604c4: add       sp, sp, #0x40
  0x0000fff781a604c8: ldr       x8, [x28, #264]
  0x0000fff781a604cc: ldr       wzr, [x8]       ;   {poll_return}
  0x0000fff781a604d0: ret

...

Compiled method (c2)   27497  535       4       org.sample.VolTest::runVolT4 (8 bytes)
  ...
[Constants]
  # {method} {0x0000fff77cbc4ff0} 'runVolT4' '(Lorg/sample/VolTest$State1;)I' in 'org/sample/VolTest'
  # this:     c_rarg1:c_rarg1
                        = 'org/sample/VolTest'
  # parm0:    c_rarg2:c_rarg2
                        = 'org/sample/VolTest$State1'
  ...
[Verified Entry Point]
  ...
                                                ; - org.sample.VolTest::runVolT4@-1 (line 66)
  0x0000fff788f69674: ldr       w11, [x2, #16]  ;*getfield v2 {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.VolTest::runVolT4@1 (line 66)
                                                ; implicit exception: dispatches to 0x0000fff788f69690
  0x0000fff788f69678: ldr       w0, [x11, #12]  ;*getfield f1 {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.VolTest::runVolT4@4 (line 66)
                                                ; implicit exception: dispatches to 0x0000fff788f69698
  0x0000fff788f6967c: ldp       x29, x30, [sp, #16]
  0x0000fff788f69680: add       sp, sp, #0x20
  0x0000fff788f69684: ldr       x8, [x28, #264]
  0x0000fff788f69688: ldr       wzr, [x8]       ;   {poll_return}
  0x0000fff788f6968c: ret

Let's note what barrier instructions the generated assembly contains:

  • runVolT1 (the assembly isn't shown above because it's too long):
    • c1 version contains 1x dmb ishst, 2x dmb ish
    • c2 version contains 1x dmb ishst, 1x dmb ish, 1x stlr
  • runVolT3 (which reads volatile v1):
    • c1 version 1x dmb ish, 1x dmb ishld
    • c2 version 1x ldar
  • runVolT4 (which reads nonvolatile v2): no memory barriers

As you see, runVolT4 (which reads the object after unsafe republication) doesn't contain memory barriers.

Does it mean that the thread can see the object state as semi-initialized?
Turns out no, on ARMv8 it is safe nonetheless.

Why?
Look at return s.v2.f1; in the code. Here CPU performs 2 memory reads:

  • first it reads s.v2, which contains the memory address of object o
  • then it reads value of o.f1 from (memory address of o) + (offset of field f1 within Obj1)

The memory address for the o.f1 read is computed from the value returned by the s.v2 read — this is so called "address dependency".

On ARMv8 such address dependency prevents reordering of this two reads (see MP+dmb.sy+addr example in Modelling the ARMv8 architecture, operationally: concurrency and ISA, you can try it yourself in ARM's Memory Model Tool) — so we are guaranteed to see the v2 as fully initialized.

Memory barrier instructions in runVolT3 serve different purpose: they prevent reordering of the volatile read of s.v1 with other actions within the thread (in Java a volatile read is one of synchronization actions, which must be totally ordered).

More than that, it turns out today on all the supported by OpenJDK architectures address dependency prevents reordering of reads (see "Dependent loads can be reordered" in this table in wiki or "Data dependency orders loads?" in table in The JSR-133 Cookbook for Compiler Writers).

As a result, today on OpenJDK if an object is initially published through a volatile field, then it will most likely be visible as fully initialized even after unsafe republication.

2. Initial publication through a synchronized block: most probably unsafe

Situation is different when initial publication is done through a synchronized block:

class Obj1 {
  int f1 = 0;
}

Obj1 v1;
Obj1 v2;

Thread 1              | Thread 2       | Thread 3
--------------------------------------------------------
synchronized {        |                |
  var o = new Obj1(); |                |
  o.f1 = 1;           |                |
  v1 = o;             |                |
}                     |                |
                      | synchronized { |
                      |   var r1 = v1; |
                      | }              |
                      | v2 = r1;       |
                      |                | var r2 = v2.f1;

Is (r2 == 0) possible?

Here the generated assembly for Thread 3 is the same as for runVolT4 above: it contains no memory barrier instructions. As a result, Thread 3 can easily see writes from Thread 1 out of order.

And generally, unsafe republication in such cases is most probably unsafe today on OpenJDK.

  • Impressive work! Of course I like it since it supports my original suspicion but great work regardless! – Erik Feb 22 '21 at 10:21
  • @Erik this question has bothered me for days. since `volatile` offers sequential consistency + `jls` does not allow out of thin air values; to me, when using `volatile` you can never see `r1 == 0`. With `synchronized`, this lock could be elided, if JIT can prove it is not needed, so I guess that is why "no barriers" in the latter case. – Eugene Feb 22 '21 at 16:57
  • your analysis is fantastic. I've tried to reproduce this with `jcstress` for a few days, and could not (but I am on `x86`). I am still struggling to build the proper model in my head according to `JLS` though. – Eugene Feb 22 '21 at 20:46
  • Can I ask you on what hardware did you run this? Where did u get an ARM laptop/computer? – Eugene Feb 24 '21 at 04:18
  • @Eugene I used x86 machine: I used kvm+[qemu](https://qemu.readthedocs.io/en/latest/system/targets.html) to create a virtual ARMv8 machine for the tests. I'm not sure that qemu emulates all the relaxations of the ARM's memory model, but the assembly generated by the JIT should be the same as on real ARMv8. –  Feb 24 '21 at 04:39
  • @wkdtbqmw oh! That makes it even more awesome, the amount of time you dedicated to this... – Eugene Feb 24 '21 at 04:45
4

Answer: Causality part of the JMM allows Thread 3 to see o as partially constructed.

I finally managed apply 17.4.8. Executions and Causality Requirements (aka the causality part of the JMM) to this example.

So this is our Java program:

class Obj1 {
  int f1;
}

volatile Obj1 v1;
Obj1 v2;

Thread 1            | Thread 2 | Thread 3
--------------------|----------|-----------------
var o = new Obj1(); |          |
o.f1 = 1;           |          |
v1 = o;             |          |
                    | v2 = v1; |
                    |          | var r1 = v2.f1;

And we want to find out if the result (r1 == 0) is allowed.

Turns out, to prove that (r1 == 0) is allowed, we need to find a well-formed execution, which gives that result and can be validated with the algorithm given in 17.4.8. Executions and Causality Requirements.

First let's rewrite our Java program in terms of variables and actions as defined in the algorithm.
Let's also show the values for our read and write actions to get the execution E we want to validate:

Initially: W[v1]=null, W[v2]=null, W[o.f1]=0

Thread 1  | Thread 2 | Thread 3
----------|----------|-----------
W[o.f1]=1 |          |
Wv[v1]=o  |          |
          | Rv[v1]=o |
          | W[v2]=o  |
          |          | R[v2]=o
          |          | R[o.f1]=0

Notes:

  • o represents the instance created by new Obj1(); in the java code
  • W and R represent normal writes and reads; Wv and Rv represent volatile writes and reads
  • read/written value for the action is shown after =
  • W[o.f1]=0 is in the initial actions because according to the JLS:

    The write of the default value (zero, false, or null) to each variable synchronizes-with the first action in every thread.
    Although it may seem a little strange to write a default value to a variable before the object containing the variable is allocated, conceptually every object is created at the start of the program with its default initialized values.

Here is a more compact form of E:

W[v1]=null, W[v2]=null, W[o.f1]=0
---------------------------------
W[o.f1]=1 |          |
Wv[v1]=o  |          |
          | Rv[v1]=o |
          | W[v2]=o  |
          |          | R[v2]=o
          |          | R[o.f1]=0

Validation of E

According to 17.4.8. Executions and Causality Requirements:

A well-formed execution E = < P, A, po, so, W, V, sw, hb > is validated by committing actions from A. If all of the actions in A can be committed, then the execution satisfies the causality requirements of the Java programming language memory model.

So we need to build step-by-step the set of committed actions (we get a sequence C₀,C₁,... , where Cₖ is the set of committed actions on the k-th iteration, and Cₖ ⊆ Cₖ₊₁) until we commit all actions A of our execution E.
Also the JLS section contains 9 rules which define when an action can me committed.

  • Step 0: the algorithm always starts with an empty set.

    C₀ = ∅
    
  • Step 1: we commit only writes.
    The reason is that according to rule 7, a committed a read in Сₖ must return a write from Сₖ₋₁, but we have empty C₀.

    E₁:
    
    W[v1]=null, W[v2]=null, W[o.f1]=0
    ----------------------------------
    W[o.f1]=1 |          |
    Wv[v1]=o  |          |
    
    C₁ = { W[v1]=null, W[v2]=null, W[o.f1]=0, W[o.f1]=1, Wv[v1]=o }
    
  • Step 2: now we can commit the read and the write of o in Thread 2.
    Since v1 is volatile, Wv[v1]=o happens-before Rv[v1], and the read returns o.

    E₂:
    
    W[v1]=null, W[v2]=null, W[o.f1]=0
    ---------------------------------
    W[o.f1]=1 |          |
    Wv[v1]=o  |          |
              | Rv[v1]=o |
              | W[v2]=o  |
    
    C₂ = C₁∪{ Rv[v1]=o, W[v2]=o }
    
  • Step 3: now the we have W[v2]=o committed, we can commit the read R[v2] in Thread 3.
    According to rule 6, a currently committed read can only return a happens-before write (the value can be changed once to a racy write on the next step).
    R[v2] and W[v2]=o are not ordered with happens-before, so R[v2] reads null.

    E₃:
    
    W[v1]=null, W[v2]=null, W[o.f1]=0
    ---------------------------------
    W[o.f1]=1 |          |
    Wv[v1]=o  |          |
              | Rv[v1]=o |
              | W[v2]=o  |
              |          | R[v2]=null
    
    C₃ = C₂∪{ R[v2]=null }
    
  • Step 4: now R[v2] can read W[v2]=o through a data race, and it makes R[o.f1] possible.
    R[o.f1] reads the default value 0, and the algorithm finishes because all the actions of our execution are committed.

    E = E₄:
    
    W[v1]=null, W[v2]=null, W[o.f1]=0
    ---------------------------------
    W[o.f1]=1 |          |
    Wv[v1]=o  |          |
              | Rv[v1]=o |
              | W[v2]=o  |
              |          | R[v2]=o
              |          | R[o.f1]=0
    
    A = C₄ = C₂∪{ R[v2]=o, R[o.f1]=0 }
    

As a result, we validated an execution which produces (r1 == 0), therefore, this result is valid.


Also, it worth noting, that this causality validation algorithm adds almost no additional restrictions to happens-before.
Jeremy Manson (one of the JMM authors) explains that the algorithm exists to prevent a rather bizarre behavior — so called "causality loops" when there is a circular chain of actions which causes each other (i.e. when an action causes itself).
In every other case except for these causality loops we use happens-before like in the Tom's comment.

  • 2
    that causality in the `JLS` is supposed to explain (and prohibit) a simple thing: `OoTA`. Unfortunately, every time I try to read it and explain it, my head hurts. You have done a great job here (I don't have a more powerful adjective) and I've read it 3 times, slowly, seems to make perfect sense. Overall, fantastic dedication. – Eugene Feb 23 '21 at 17:39
  • @Eugene Thank you for kind words. And an even bigger "thank you" for reading the text 3 times and checking it for errors — I really appreciate it because this is the 1st time I applied the causality algorithm to a program, so errors are very probable. Regarding "every time I try to read it and explain it, my head hurts" — I also don't understand a lot in it (i.e. I hope I understand how to apply the causality algorithm, but I still don't understand why various steps and rules in the algorithm are the way they are). –  Feb 23 '21 at 18:40
  • @Eugene "that causality in the `JLS` is supposed to explain (and prohibit) a simple thing: `OoTA`" I am not sure it is simple: in [Jeremy Manson's thesis](https://drum.lib.umd.edu/bitstream/handle/1903/1949/umi-umd-1898.pdf) (which is the most detailed explanation of the JMM) he says: "Determining what constitutes an out-of-thin-air read is complicated", and then spend a huge junk of the thesis explaining it. –  Feb 23 '21 at 18:58
  • 1
    yeah, it was somehow wrong for me to say "simple". The `OoTA` is not simple, but, afaik, can only be achieved when there are speculative reads. This is the absolute great simplification I have build in my head around it, and it has spilled into the comment; sorry about that. There is a reason `C/C++` has not even tried to specify this, they just say : "welcome to undefined behavior". – Eugene Feb 23 '21 at 19:04
  • regarding the OoTA: yesterday I found an [answer on SO](https://stackoverflow.com/a/26500593) which mentions [Outlawing Ghosts: Avoiding Out-of-Thin-Air Results](https://dl.acm.org/doi/pdf/10.1145/2618128.2618134). The explanations of OoTA in this paper are very clear, and reading them really helped me to understand the OoTA parts of [Jeremy Manson's thesis](https://drum.lib.umd.edu/bitstream/handle/1903/1949/umi-umd-1898.pdf) much better (it the thesis OoTA is explained mostly via examples of acceptable/unacceptable behavior). So I would recommend the paper to anyone interested in OoTA. –  Feb 25 '21 at 07:02
  • indeed an excellent paper, what a gem. My weekend reading is going to be awesome, thank u so much for this – Eugene Feb 26 '21 at 12:58
  • I did not like the paper after reading it. The examples with "speculative reads" are far too complicated ( under explained for the average reader at least ). And their proposal is, as far as I understand, a `LoadStore` before every store. Has this been measured? Yeah, on x86 not so bad, but on weaker? A `lsync` before almost every store? I am not the right person to judge this, but I have to scratch my head twice. – Eugene Mar 01 '21 at 05:02