24

I've run across a rather strange problem that I can create when running Java 8. The problem presents itself as if some sort of timing error is happening within the JVM itself. It is intermittent in nature, but easily reproducible (at least within my test environments). The problem is that an array value being explicitly set is being destroyed and replaced with a 0.0 under certain circumstances. Specifically, in the code below, array[0] is evaluating to 0.0 after the line new Double(r.nextDouble());. Then, if you immediately look at the contents of array[0] again, it now shows the value to be the correct value of 1.0. Sample output from running this test case is:

claims array[0] != 1.0....array[0] = 1.0
claims array[0] now == 1.0...array[0] = 1.0`

I'm running 64-bit Windows 7 and am able to reproduce this problem, from both within Eclipse and when compiling from the command line, with JDKs 1.8_45, 1.8_51, and 1.8_60. I am unable to produce the problem running 1.7_51. The same results have been demonstrated on another 64-bit Windows 7 box.

This problem appeared in a large, non-trivial piece of software, but I've managed to condense it down to a few lines of code. Below is a small test case that demonstrates the issue. It is a rather odd looking test case, but appears to all be necessary to cause the error. The use of Random is not required - I can replace all r.nextDouble() with any double value and demonstrate the issue. Interestingly enough, if someArray[0] = .45; is replaced with someArray[0] = r.nextDouble();, I could not replicate the issue (although there is nothing special about .45). Eclipse debugging is also of no help - it changes the timing enough such that it doesn't happen anymore. Even a well placed System.err.println() statement will cause the issue to no longer appear.

Again, the issue is intermittent, so to reproduce the issue one might have to run this test case several times. I think the most I've had to run it is around 10 times before I get the output shown above. In Eclipse, I give it a second or two after running and then kill it if it hasn't happened. From the command line the same - run it, if it doesn't happen CTRL+C to quit and try again. It appears that if it's going to happen, it happens pretty quickly.

I've come across issues like this in the past, but they were all threading issues. I can't figure out what's going on here - I've even looked at bytecode (which was identical between 1.7_51 and 1.8_45, by the way).

Any ideas on what is happening here?

import java.util.Random;

public class Test { 
    Test(){
        double array[] = new double[1];     
        Random r = new Random();

        while(true){
            double someArray[] = new double[1];         
            double someArray2 [] = new double [2];

            for(int i = 0; i < someArray2.length; i++) {
                someArray2[i] = r.nextDouble();
            }

            // for whatever reason, using r.nextDouble() here doesn't seem
            // to show the problem, but the # you use doesn't seem to matter either...

            someArray[0] = .45;

            array[0] = 1.0;

            // commented out lines also demonstrate problem
            new Double(r.nextDouble());
            // new Float(r.nextDouble();
            // double d = new Double(.1) * new Double(.3);
            // double d = new Double(.1) / new Double(.3);
            // double d = new Double(.1) + new Double(.3);
            // double d = new Double(.1) - new Double(.3);

            if(array[0] != 1.0){
                System.err.println("claims array[0] != 1.0....array[0] = " + array[0]);

                if(array[0] != 1.0){
                    System.err.println("claims array[0] still != 1.0...array[0] = " + array[0]);
                }else {
                    System.err.println("claims array[0] now == 1.0...array[0] = " + array[0]);
                }

                System.exit(0);
            }else if(r.nextBoolean()){
                array = new double[1];
            }
        }
    }

    public static void main(String[] args) {
        new Test();
    }
}
bcothren
  • 243
  • 1
  • 7
  • did you check for bad memory? – wero Oct 07 '15 at 14:19
  • 3
    I can't reproduce this. Works as expected here. – marstran Oct 07 '15 at 14:20
  • 2
    `double` is inherently not precise. Are you sure that this is not your issue? – PM 77-1 Oct 07 '15 at 14:20
  • @wero it's not likely that multiple machines all happen to have bad memory (I can reproduce on 3 different 64-bit Win7 boxes). @marstran is your setup as I described? @PM 77-1 yes, but that shouldn't affect the `if` evaluation. – bcothren Oct 07 '15 at 14:25
  • @bcothren not likely but still possible – wero Oct 07 '15 at 14:27
  • @wero true, I'll run some memchecks – bcothren Oct 07 '15 at 14:28
  • 3
    I suggest you file a bug against the JDK if you can reproduce this. I can only guess it has something to do with JIT kicking in. Out of interest: is the `new Double(...)` needed? This is something I wouldn't expect to find in real code. @PM77-1: Storing the integral value of `1.0` in a `double` should never lead to such problems since it can be represented without loss of precision. If `==` gives false results, I'd rather expect that for some reason a boxed value is used (due to an error in the JVM). – Axel Oct 07 '15 at 14:37
  • @Axel in real code it was actually doing some arithmetic, I just boiled it down to the creation of a new `Double` in an attempt to come up with the simplest test case. Are you able to reproduce the problem? It could be premature to file a bug report? – bcothren Oct 07 '15 at 14:40
  • 3
    I can reproduce this (oracle java 1.8.0_60 on 64-bit linux machine). – Roman Oct 07 '15 at 14:47
  • @bcothren I'm stuck at JDK 7 at work... – Axel Oct 07 '15 at 14:48
  • @Axel your comment about JIT makes sense. I'm going to start trying to replicate the issue after disabling JIT. – bcothren Oct 07 '15 at 15:03
  • 1
    Looks similar to http://stackoverflow.com/a/32986956/5032339 (also a "should not happen" issue). After I excluded your code from JIT compiling, it worked fine. – Roman Oct 07 '15 at 15:04
  • 1
    @PM77-1 doubles may be imprecise, but `1.0d == 1.0d` should still be true. Always. – assylias Oct 07 '15 at 15:09
  • 4
    It must have been a lot of work to track down the issue from production code to this small example… – Holger Oct 07 '15 at 15:13
  • @PM77-1 There is a considerable number of integers that can be represented without rounding error in a double so that == works without a problem. – laune Oct 07 '15 at 15:18
  • @Axel @Roman Disabling JIT (I went big and just used `-Xint`) does indeed remove the issue. I do not like this as a solution, but at least it is a reason for the issue. – bcothren Oct 07 '15 at 15:23
  • 1
    The solution would be, as Axel said, to file a bug. – Roman Oct 07 '15 at 15:29
  • @Roman I will do that. I really meant a solution for our software... – bcothren Oct 07 '15 at 16:27
  • 3
    This is definitely a JIT optimization bug. `-XX:-TieredCompilation` or `-XX:-EliminateAllocations` is likely to be an acceptable workaround that does not significantly degrade the performance. – apangin Oct 07 '15 at 18:20

2 Answers2

21

Update: seems that my original answer was incorrect and OnStackReplacement just revealed the problem in this particular case, but the original bug was in the escape analysis code. Escape analysis is a compiler subsystem which determines whether object escapes from the given method or not. Non-escaped objects can be scalarized (instead of on-heap allocation) or totally optimized out. In our test escape analysis does matter as several created objects surely don't escape the method.

I downloaded and installed JDK 9 early access build 83 and noticed that the bug disappears there. However in JDK 9 early access build 82 it still exists. The changelog between b82 and b83 shows only one relevant bug fix (correct me if I'm wrong): JDK-8134031 "Incorrect JIT compilation of complex code with inlining and escape analysis". The committed testcase is somewhat similar: big loop, several boxes (similar to one-element arrays in our test) which lead to the sudden change of the value inside the box, so the result becomes silently incorrect (no crash, no exception, just incorrect value). As in our case it's reported that problem does not appear prior to 8u40. The introduced fix is very short: just a one line change in escape analysis source.

According to OpenJDK bug tracker, the fix is already backported to JDK 8u72 branch, which is scheduled to be released in January, 2016. Seems that it was too late to backport this fix to the upcoming 8u66.

The suggested work-around is to disable escape analysis (-XX:-DoEscapeAnalysis) or to disable eliminate allocations optimization (-XX:-EliminateAllocations). Thus @apangin was actually closer to the answer than me.

Below is the original answer


First, I cannot reproduce the problem with JDK 8u25, but can on JDK 8u40 and 8u60: sometimes it runs correctly (stuck in infinite loop), sometimes it outputs and exits. So if JDK downgrade to 8u25 is acceptable for you, you may consider to do this. Note that if you need later fixes in javac (many things especially involving lambdas were fixed in 1.8u40), you can compile with newer javac, but run on older JVM.

For me it seems that this particular problem is likely a bug in OnStackReplacement mechanism (when OSR occurs at tier 4). If you're unfamiliar with OSR, you may read this answer. The OSR surely occurs in your case, but in a little bit weird way. Here's -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls for failed run (% means OSR JIT, @ 28 means OSR bytecode position, (3) and (4) means tier level):

...
     91   37 %     3       Test::<init> @ 28 (194 bytes)
Installing osr method (3) Test.<init>()V @ 28
     93   38       3       Test::<init> (194 bytes)
Installing method (3) Test.<init>()V 
     94   39 %     4       Test::<init> @ 16 (194 bytes)
Installing osr method (4) Test.<init>()V @ 16
    102   40 %     4       Test::<init> @ 28 (194 bytes)
    103   39 %     4       Test::<init> @ -2 (194 bytes)   made not entrant
...
Installing osr method (4) Test.<init>()V @ 28
    113   37 %     3       Test::<init> @ -2 (194 bytes)   made not entrant
claims array[0] != 1.0....array[0] = 1.0
claims array[0] now == 1.0...array[0] = 1.0

Thus the OSR at tier4 occurs for two different bytecode offsets: offset 16 (which is the while loop entry point) and offset 28 (which is the nested for loop entry point). It seems that some race condition occurs during the context transfer between both OSR-compiled versions of your method which results in broken context. When execution is handed over to OSR method, it should transfer current context including the values of local variables like array and r into the OSR'ed method. Something bad happens here: probably for a short time <init>@16 OSR version works, then it's replaced with <init>@28, but context is updated with a little delay. It's likely that OSR context transfer interferes with "eliminate allocations" optimization (as noted by @apangin switching off this optimization helps in your case). My expertise is not enough to dig further here, probably @apangin may comment.

In contrast in normal run only one copy of tier 4 OSR method is created and installed:

...
Installing method (3) Test.<init>()V 
     88   43 %     4       Test::<init> @ 28 (194 bytes)
Installing osr method (4) Test.<init>()V @ 28
    100   40 %     3       Test::<init> @ -2 (194 bytes)   made not entrant
   4592   44       3       java.lang.StringBuilder::append (8 bytes)
...

So seems that in this case no race between two OSR versions occur and everything works perfectly.

The problem also disappears if you move the outer loop body to the separate method:

import java.util.Random;

public class Test2 {
    private static void doTest(double[] array, Random r) {
        double someArray[] = new double[1];
        double someArray2[] = new double[2];

        for (int i = 0; i < someArray2.length; i++) {
            someArray2[i] = r.nextDouble();
        }

        ... // rest of your code
    }

    Test2() {
        double array[] = new double[1];
        Random r = new Random();

        while (true) {
            doTest(array, r);
        }
    }

    public static void main(String[] args) {
        new Test2();
    }
}

Also manual unrolling the nested for loop removes the bug:

int i=0;
someArray2[i++] = r.nextDouble();
someArray2[i++] = r.nextDouble();

To hit this bug it seems that you should have at least two nested loops in the same method, so OSR can occur at different bytecode positions. So to work-around problem in your particular piece of code you can just do the same: extract the loop body into the separate method.

An alternative solution is to disable the OSR completely with -XX:-UseOnStackReplacement. It's rarely helps in production code. Loop counters still work and if your method with many-iterations-loop is called at least twice, the second run will be JIT-compiled anyways. Also even if your method with long loop is not JIT-compiled due to disabled OSR, any methods it calls will still be JIT-compiled.

Community
  • 1
  • 1
Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
  • Great work. Please include this in the bug report as it might help the JDK developers fixing the issue. I'd give +2 if I could... :-) – Axel Oct 08 '15 at 05:38
  • Yeah, on-stack replacement helps if you have big long running methods whose code is performance relevant, which is a pattern that does not match typical application code, but typical artificial benchmark code. – Holger Oct 08 '15 at 09:09
  • Wow, great work! I had already submitted the bug report and it is still under review. Assuming I can add information to it when/if it gets accepted, I will certainly include this. Thanks again! – bcothren Oct 12 '15 at 17:05
  • 1
    @bcothren, I edited the answer. Seems that the problem was somewhat different and it's already fixed. – Tagir Valeev Oct 13 '15 at 10:25
0

I can reproduce this error in Zulu (a certified build of OpenJDK) with the code posted at http://www.javaspecialists.eu/archive/Issue234.html

With the oracle VM, I can only reproduce this error after I have run the code in Zulu. It seems like Zulu contaminates the shared lookup cache. The solution in this case is to run the code with -XX:-EnableSharedLookupCache.

  • 1
    Azul has 2 JVMs Zulu and Zing. From the link you provide (which is broken) it seems that you are refering to Zulu, not Zing. Zulu is an OpenJDK build which entirely OpenJDK code, but tested and supported. It should exhibit the same behaviour for the comparable version. Zing is a different beast altogether. – Nitsan Wakart Dec 08 '16 at 07:39