210

A sensitive operation in my lab today went completely wrong. An actuator on an electron microscope went over its boundary, and after a chain of events I lost $12 million of equipment. I've narrowed down over 40K lines in the faulty module to this:

import java.util.*;

class A {
    static Point currentPos = new Point(1, 2);
    static class Point {
        int x;
        int y;
        Point(int x, int y) {
            this.x = x;
            this.y = y;
        }
    }
    public static void main(String[] args) {
        new Thread() {
            void f(Point p) {
                synchronized(this) {}
                if (p.x + 1 != p.y) {
                    System.out.println(p.x + " " + p.y);
                    System.exit(1);
                }
            }
            @Override
            public void run() {
                while (currentPos == null);
                while (true)
                    f(currentPos);
            }
        }.start();
        while (true)
            currentPos = new Point(currentPos.x + 1, currentPos.y + 1);
    }
}

Some samples of the output I'm getting:

$ java A
145281 145282
$ java A
141373 141374
$ java A
49251 49252
$ java A
47007 47008
$ java A
47427 47428
$ java A
154800 154801
$ java A
34822 34823
$ java A
127271 127272
$ java A
63650 63651

Since there isn't any floating point arithmetic here, and we all know signed integers behave well on overflow in Java, I'd think there's nothing wrong with this code. However, despite the output indicating that the program didn't reach the exit condition, it reached the exit condition (it was both reached and not reached?). Why?


I've noticed this doesn't happen in some environments. I'm on OpenJDK 6 on 64-bit Linux.

Lii
  • 11,553
  • 8
  • 64
  • 88
Dog
  • 7,707
  • 8
  • 40
  • 74
  • 41
    12 milion of equipment ? i am really curious how that could happen... why you are using empty synchronization block : synchronized(this) {} ? – Martin V. Apr 23 '13 at 01:07
  • 86
    This isn't even remotely thread-safe. – Matt Ball Apr 23 '13 at 01:14
  • 1
    @MattBall, and this is reduced code. Obviously the write to `currentPos` doesn't `happen-before` the read of it, but I don't see how that can be the issue. – Dog Apr 23 '13 at 01:23
  • 1
    @MartinV. because that's the only way I could reproduce the behavior. Locking on about anything does the same thing. – Dog Apr 23 '13 at 01:24
  • 9
    Interesting to note: adding the `final` qualifier (which has no effect on produced bytecode) to the fields `x` and `y` "solves" the bug. Although it does not affect bytecode, the fields are flagged with it, which leads me to think this is a side-effect of a JVM optimization. – Niv Steingarten Apr 23 '13 at 02:26
  • 3
    @Dog Also the output, is very much 'expected'. Two independent Threads acting on shared data without any type of synchronization, there is no way to tell the result of this. It might even never end at all – Eugene Apr 23 '13 at 08:03
  • 9
    @Eugene: It should **not** end. The question is "why does it end?". A `Point` `p` is constructed which satisfies `p.x+1 == p.y`, then a **reference** is passed to the polling thread. Eventually the polling thread decides to exit because it thinks the condition isn't satisfied for one of the `Point`s it receives, but then the console output shows that it should have been satisfied. The lack of `volatile` here simply means that the polling thread may get stuck, but that clearly isn't the problem here. – Erma K. Pizarro Apr 23 '13 at 12:52
  • 2
    @ErmaK.Pizarro if these two lines : if (p.x+1 != p.y) System.out.println(p.x+" "+p.y); come one after another, it does not mean that they(together) are atomic. Anything can happen between them. Unless we know what he wants, it is impossible to answer – Eugene Apr 23 '13 at 13:06
  • 2
    @Eugene, I don't think you're reading the code. `p.x` and `p.y` can't be modified, they are only set in the constructor of `Point`. The only way to get a `Point` is to call the constructor which doesn't return until `p.x` and `p.y` are set. Clearly no code here modifies an already existing `Point`. – Dog Apr 23 '13 at 14:50
  • 2
    I tried very hard to simplify this. See http://stackoverflow.com/questions/16178020/uninitialized-object-leaked-to-another-thread-despite-no-code-explicitly-leaking – Dog Apr 23 '13 at 19:57
  • 3
    @MattBall: You appear not to know very much about Java concurrency. This program can be made thread safe simply by making `currentPos` `volatile`. That would ensure the writes within the constructor `happen-before` the write of `currentPos`. Another way is to simply make the fields in `Point` `final`, which would work because of final field semantics section of JLS. The synchronized block is simply a noop which OP put to reproduce the problem. Even without knowledge of the Java Memory Model, most people would assume this program is thread safe. – L̲̳o̲̳̳n̲̳̳g̲̳̳p̲̳o̲̳̳k̲̳̳e̲̳̳ Jun 29 '13 at 19:06
  • 2
    I guess this is an example of why testing is considered more than desirable. – John Nicholas Aug 15 '13 at 16:03
  • 23
    @JohnNicholas: The real code (which is obviously not this) had 100% test coverage and thousands of tests, lots of which tested things in thousands of various orders and permutations... Testing doesn't magically find every edge case caused by nondeterministic JIT/cache/scheduler. The real problem is that the developer who wrote this code didn't know that construction doesn't happen before using the object. Notice how removing the empty `synchronized` makes the bug not happen? That's because I had to randomly write code until I found one that would reproduce this behaviour deterministically. – Dog Aug 15 '13 at 16:15
  • 1
    cool, you should update your question with that info about synchronized btw :D Am personally finding this one fascinating as it unfolds. – John Nicholas Aug 15 '13 at 16:20
  • In cases where software failure can cause events with Bad Consequences you should *always* have some kind of independent supervisory feature. Medical/aerospace/automotive/rail/nuclear applications know this. – Jason S Aug 16 '13 at 02:20
  • 4
    Why didn't it use a hardware interlock! have we not learnt anything since 1985? (Therac-25) : http://www.codingninja.co.uk/trust-me-im-a-programmer/ – Darknight Aug 16 '13 at 08:45
  • 6
    @Dog Is this http://news.nationalpost.com/2013/08/16/lion-outed-as-dog-at-chinese-zoo-after-barking-outfit-had-dog-in-wolf-cage-fox-in-leopard-enclosure/ you? – Eric des Courtis Aug 16 '13 at 18:01
  • +1 to Darknight regarding Therac-25. At least this incident only resulted in equipment damage instead of killing people. – Stuart Marks Aug 17 '13 at 05:51

5 Answers5

141

Obviously the write to currentPos doesn't happen-before the read of it, but I don't see how that can be the issue.

currentPos = new Point(currentPos.x+1, currentPos.y+1); does a few things, including writing default values to x and y (0) and then writing their initial values in the constructor. Since your object is not safely published those 4 write operations can be freely reordered by the compiler / JVM.

So from the perspective of the reading thread, it is a legal execution to read x with its new value but y with its default value of 0 for example. By the time you reach the println statement (which by the way is synchronized and therefore does influence the read operations), the variables have their initial values and the program prints the expected values.

Marking currentPos as volatile will ensure safe publication since your object is effectively immutable - if in your real use case the object is mutated after construction, volatile guarantees won't be enough and you could see an inconsistent object again.

Alternatively, you can make the Point immutable which will also ensure safe publication, even without using volatile. To achieve immutability, you simply need to mark x and y final.

As a side note and as already mentioned, synchronized(this) {} can be treated as a no-op by the JVM (I understand you included it to reproduce the behaviour).

assylias
  • 321,522
  • 82
  • 660
  • 783
  • 4
    I'm not sure, but wouldn't making x and y final have the same effect, avoiding the memory barrier? – Michael Böckling Aug 15 '13 at 15:19
  • 3
    A simpler design is an immutable point object that tests invariants on construction. So you never risk publishing a dangerous configuration. – Ron Aug 15 '13 at 15:26
  • @BuddyCasino Yes indeed - I have added that. To be honest I don't remember the whole discussion 3 months ago (using final was proposed in the comments so not sure why I did not include it as an option). – assylias Aug 15 '13 at 15:30
  • 2
    Immutability itself doesn't guarantee safe publication (if x an y were private but exposed with getters only, the same publication problem would still exist). final or volatile does guarantee it. I would prefer final over volatile. – Steve Kuo Aug 15 '13 at 20:29
  • @SteveKuo Immutability requires final - without final, the best you can get is effective immutability which does not have the same semantics. – assylias Aug 15 '13 at 21:48
  • I don't know any java. Is the problem the fact `static class Point` makes all `new` share the same variable/memory address so in the middle of the constructor it can write to one variable with the other thread reads another? (w/o locking/sync). Would this suffer from the bug if `static class` wasn't used? My gut tells me based on numbers the problem is p.x is being read, ctor is locking itself (is that a feature?) and assigning x,y then +p.y happens. The if statment he uses is weird why `.x+1==.y` it seems like `141373 141373 ` should happen –  Aug 16 '13 at 03:54
  • In the `while` block, when I extract `currentPos.x+1` and `currentPos.y+1` as two local variable, the error disappears...why is that? – grape_mao Feb 13 '15 at 14:00
29

Since currentPos is being changed outside of the thread it should be marked as volatile:

static volatile Point currentPos = new Point(1,2);

Without volatile the thread is not guaranteed to read in updates to currentPos that are being made in the main thread. So new values continue to be written for currentPos but thread is continuing to use the previous cached versions for performance reasons. Since only one thread modifies currentPos you can get away without locks which will improve performance.

The results look much different if you read the values only a single time within the thread for use in the comparison and subsequent displaying of them. When I do the following x always displays as 1 and y varies between 0 and some large integer. I think the behavior of it at this point is somewhat undefined without the volatile keyword and it's possible that the JIT compilation of the code is contributing to it acting like this. Also if I comment out the empty synchronized(this) {} block then the code works as well and I suspect it is because the locking causes sufficient delay that currentPos and its fields are reread rather than used from the cache.

int x = p.x + 1;
int y = p.y;

if (x != y) {
    System.out.println(x+" "+y);
    System.exit(1);
}
Ed Plese
  • 1,568
  • 10
  • 14
19

You have ordinary memory, the 'currentpos' reference and the Point object and its fields behind it, shared between 2 threads, without synchronisation. Thus, there is no defined ordering between the writes that happen to this memory in the main thread and the reads in the created thread (call it T).

Main thread is doing the following writes (ignoring the initial setup of point, will result in p.x and p.y having default values):

  • to p.x
  • to p.y
  • to currentpos

Because there is nothing special about these writes in terms of synchronisation/barriers, the runtime is free to allow the T thread see them occur in any order (the main thread of course always sees writes and reads ordered according to programme order), and occur at any point between the reads in T.

So T is doing:

  1. reads currentpos to p
  2. read p.x and p.y (in either order)
  3. compare, and take the branch
  4. read p.x and p.y (either order) and call System.out.println

Given there's no ordering relationships between the writes in main, and the reads in T, there are clearly several ways this can produce your result, as T may see main's write to currentpos before the writes to currentpos.y or currentpos.x:

  1. It reads currentpos.x first, before the x write has occurred - gets 0, then reads currentpos.y before the y write has occurred - gets 0. Compare evals to true. The writes become visible to T. System.out.println is called.
  2. It reads currentpos.x first, after the x write has occurred, then reads currentpos.y before the y write has occurred - gets 0. Compare evals to true. Writes become visible to T... etc.
  3. It reads currentpos.y first, before the y write has occurred (0), then reads currentpos.x after the x write, evals to true. etc.

and so on... There are a number of data races here.

I suspect the flawed assumption here is thinking that the writes that result from this line are made visible across all the threads in the programme order of the thread executing it:

currentPos = new Point(currentPos.x+1, currentPos.y+1);

Java makes no such guarantee (it'd be terrible for performance). Something more must be added if your programme needs a guaranteed ordering of the writes relative to reads in other threads. Others have suggested making the x,y fields final, or alternatively making currentpos volatile.

  • If you make the x,y fields final, then Java guarantees that the writes of their values will be seen to occur before the constructor returns, in all threads. Thus, as the assignment to currentpos is after the constructor, the T thread is guaranteed to see the writes in the correct order.
  • If you make currentpos volatile, then Java guarantees that that this is a synchronisation point which will be total-ordered wrt other synchronisation points. As in main the writes to x and y must happen before the write to currentpos, then any read of currentpos in another thread must see also the writes of x, y that happened before.

Using final has the advantage that it makes the fields immutable, and thus allows the values to be cached. Using volatile leads to synchronisation on every write and read of currentpos, which might hurt performance.

See chapter 17 of the Java Language Spec for the gory details:http://docs.oracle.com/javase/specs/jls/se7/html/jls-17.html

(Initial answer assumed a weaker memory model, as I was not sure the JLS guaranteed volatile was sufficient. Answer edited to reflect comment from assylias, pointing out the Java model is stronger - happens-before is transitive - and so volatile on currentpos also suffices).

paulj
  • 191
  • 1
  • 5
  • 2
    This is the best explanation in my opinion. Thanks a lot! – skyde Aug 16 '13 at 03:37
  • 1
    @skyde but wrong on the semantics of volatile. volatile guarantees that reads of a volatile variable will see the latest available write of a volatile variable *as well as any preceding write*. In this case, if `currentPos` is made volatile, the assignement ensures safe publication of the `currentPos` object as well as its members, even if they are not volatile themselves. – assylias Aug 16 '13 at 07:49
  • Well, I was saying that I couldn't, for myself, see exactly how the the JLS guaranteed that volatile formed a barrier with other, normal reads and writes. Technically, I can't be wrong on that ;). When it comes to memory models, it is prudent to assume an ordering is not guaranteed and be wrong (you're still safe) than the other way around and be wrong and unsafe. It's great if volatile provides that guarantee. Can you explain how ch 17 of the JLS provides it? – paulj Aug 16 '13 at 08:05
  • 2
    In short, in `Point currentPos = new Point(x, y)`, you have 3 writes: (w1) `this.x = x`, (w2) `this.y = y` and (w3) `currentPos = the new point`. The program order guarantees that hb(w1, w3) and hb(w2, w3). Later in the program you read (r1) `currentPos`. If `currentPos` is not volatile, there is no hb between r1 and the w1, w2, w3, so r1 could observe any (or none) of them. With volatile, you introduce hb(w3, r1). And the hb relationship is transitive so you also introduce hb(w1, r1) and hb(w2, r1). This is summarised in Java Concurrency in Practice (3.5.3. Safe Publication Idioms). – assylias Aug 16 '13 at 08:21
  • 2
    Ah, if hb is transitive in that way, then that's a strong enough 'barrier', yes. I have to say, it's not easy to determine that 17.4.5 of the JLS defines hb to have that property. It's certainly not in the list of properties given near the beginning of 17.4.5. Transitive closure is only mentioned further down after some explanatory notes! Anyway, good to know, thanks for the answer! :). Note: I will update my answer to reflect assylias' comment. – paulj Aug 16 '13 at 09:11
  • @paulj in [17.4.5](http://docs.oracle.com/javase/specs/jls/se7/html/jls-17.html#jls-17.4.5): *If hb(x, y) and hb(y, z), then hb(x, z).* – assylias Aug 16 '13 at 09:56
  • @assylias Groan... Obvious now you point it out to me. :). Thanks. – paulj Aug 16 '13 at 12:16
-2

You could use an object to synchronize the writes and reads. Otherwise, as others said before, a write to currentPos will occur in the middle of the two reads p.x+1 and p.y.

new Thread() {
    void f(Point p) {
        if (p.x+1 != p.y) {
            System.out.println(p.x+" "+p.y);
            System.exit(1);
        }
    }
    @Override
    public void run() {
        while (currentPos == null);
        while (true)
            f(currentPos);
    }
}.start();
Object sem = new Object();
while (true) {
    synchronized(sem) {
        currentPos = new Point(currentPos.x+1, currentPos.y+1);
    }
}
  • Actually this does the job. In my first attempt I put the read inside the synchronized block, but later I realized that wasn't really necessary. – Germano Fronza Aug 15 '13 at 21:55
  • 1
    -1 The JVM can prove that `sem` is not shared and treat the synchronized statement as a no-op... The fact that it solves the issue is pure luck. – assylias Aug 16 '13 at 09:55
  • 4
    I hate multi-threaded programming, far too many things work because of luck. – Jonathan Allen Aug 17 '13 at 21:37
-3

You are accessing currentPos twice, and providing no guarantee that it is not updated in between those two accesses.

For example:

  1. x = 10, y = 11
  2. worker thread evaluates p.x as 10
  3. main thread executes the update, now x = 11 and y = 12
  4. worker thread evaluates p.y as 12
  5. worker thread notices that 10+1 != 12, so prints and exits.

You are essentially comparing two different Points.

Note that even making currentPos volatile won't protect you from this, since it's two separate reads by the worker thread.

Add an

boolean IsValid() { return x+1 == y; }

method to your points class. This will ensure that only one value of currentPos is used when checking x+1 == y.

  • currentPos is only read once, its value copied into p. p is read twice, but it is always going to be pointing the same location. – Jonathan Allen Aug 15 '13 at 21:50