4

UPDATE: The problem is at least cursorily solved with a combination of the -XX:+UseCountedLoopSafepoints and a JVM version equal to or greater than JDK 1.8.0_92 (Tested with _161).

ORIG: JDK 1.8.0 example run included, but it also happens on JDK 9.0.4.

I've been working on a weird bug that seems to manifest itself as the JVM stopping all threads, in a manner very consistent with safepoint pauses. The VM operation is sometimes listed as "no VM operation". The GC is not being engaged at the slowdown, as confirmed by the -XX options output AND by jvisualVM.

This is all in a single 100 line class below. I've paired down the code dramatically to show the problem based on a simple boolean setting. Look for the MAKE_THE_JVM_PAUSE_UNEXPECTEDLY boolean.

I'm creating a 5000x5000 array of booleans, and accessing the first 10 rows randomly at column 66. The real code caused the same issue.

The algorithm of this example itself is nonsensical. It's just a highly paired down example.

Options used for running this are in the header comments.

package conway;

/**
 * Run this with the following java VM options:
 *
 *      -verbosegc
 *      -XX:+PrintGCApplicationStoppedTime
 *      -XX:+PrintSafepointStatistics
 *      -XX:PrintSafepointStatisticsCount=1
 *      -XX:+PrintGCDetails
 *      -XX:+PerfDisableSharedMem
 */
public class Conway
{
    public static void main(String[] args)
    {
        //-----------------------[HEARTBEAT]-----------------------

        new Thread(() ->
        {
            long startTime = System.currentTimeMillis();
            long lastTime = startTime;

            for (int count=0;; count++)
            {
                // Time stamp....
                long now = System.currentTimeMillis();
                System.out.printf("%d-->Delta time from Start[%dms], from last[%dms]\n",
                                  count, now-startTime, now-lastTime);
                System.out.flush();
                lastTime = now;

                // Waste time without sleeping...
                for (long start = System.currentTimeMillis();
                     System.currentTimeMillis() - start < 200;);
            }
        }).start();

        //-----------------------[COMPUTATION]-----------------------

        one = new boolean[5000][5000];

        boolean MAKE_THE_JVM_PAUSE_UNEXPECTEDLY = true;

        for (;;)
        {
            if (MAKE_THE_JVM_PAUSE_UNEXPECTEDLY)
            {
                /**
                 * CALLING sweep() will cause all threads to periodically pause...
                 */
                sweep();
            }
            else
            {
                /**
                 * THIS ONE RUNS FINE.  Exact same contents as in sweep() below...
                 */
                for (int y=0; y<5000; y++)
                {
                    for (int x=0; x<5000; x++)
                    {
                        int living=0;
                        for (int yy=y-1; yy <= y+1; yy++)
                            for (int xx=x-1; xx <= x+1; xx++)
                            {
                                if (one[randomInt(0, 10)][66])
                                    living++;
                            }
                    } // for(x...)

                } // for(y...)
            }
        } // for(;;)
    }

    public static void sweep()
    {
        for (int y=0; y<5000; y++)
        {
            for (int x=0; x<5000; x++)
            {
                int living=0;
                for (int yy=y-1; yy <= y+1; yy++)
                    for (int xx=x-1; xx <= x+1; xx++)
                    {
                        if (one[randomInt(0,10)][66])
                            living++;
                    }
            } // for(x...)

        } // for(y...)
    }

    public static int randomInt(int first, int last)
    {
        return (int)(((last-first+1)*Math.random())+first);
    }

    public static boolean[][] one;
}

Now when I run this thing with the MAKE_THE_JVM_PAUSE_UNEXPECTEDLY boolean set to true, it causes the weird slowdowns.

Note that the problem first exhibits itself between heartbeat 49 and 50. The Heartbeats are from the first (anonymously launched) thread in the code and the algorithm itself is using the main class thread and are identified by the lines that look similar to this:

11-->Delta time from Start[2210ms], from last[200ms]

I terminated the run myself.

[GC (Allocation Failure) [PSYoungGen: 512K->384K(1024K)] 512K->384K(523776K), 0.0004807 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.053: ParallelGCFailedAllocation       [       5          0              0    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0005755 seconds
[GC (Allocation Failure) [PSYoungGen: 896K->480K(1024K)] 896K->488K(523776K), 0.0003818 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.063: ParallelGCFailedAllocation       [       5          0              0    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0004564 seconds
[GC (Allocation Failure) [PSYoungGen: 992K->496K(1024K)] 1000K->536K(523776K), 0.0004566 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.072: ParallelGCFailedAllocation       [      10          1              1    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0007704 seconds
[GC (Allocation Failure) [PSYoungGen: 1008K->480K(1536K)] 1048K->660K(524288K), 0.0009025 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.082: ParallelGCFailedAllocation       [      10          0              0    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0009905 seconds
[GC (Allocation Failure) [PSYoungGen: 1499K->499K(1536K)] 1679K->1006K(524288K), 0.0006638 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.102: ParallelGCFailedAllocation       [      11          1              1    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0009292 seconds
[GC (Allocation Failure) [PSYoungGen: 1518K->992K(2560K)] 2025K->2018K(525312K), 0.0007172 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.103: ParallelGCFailedAllocation       [      11          0              1    ]      [     0     0     0     0     0    ]  0   
Total time for which application threads were stopped: 0.0007980 seconds
[GC (Allocation Failure) [PSYoungGen: 2526K->1507K(3072K)] 3552K->3515K(525824K), 0.0009654 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.107: ParallelGCFailedAllocation       [      11          0              1    ]      [     0     0     0     0     1    ]  0   
Total time for which application threads were stopped: 0.0010672 seconds
[GC (Allocation Failure) [PSYoungGen: 3041K->2547K(5120K)] 5049K->5128K(527872K), 0.0010321 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.108: ParallelGCFailedAllocation       [      11          0              2    ]      [     0     0     0     0     1    ]  0   
Total time for which application threads were stopped: 0.0011153 seconds
[GC (Allocation Failure) [PSYoungGen: 5102K->3072K(5632K)] 7684K->7718K(528384K), 0.0012587 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.109: ParallelGCFailedAllocation       [      11          0              1    ]      [     0     0     0     0     1    ]  0   
Total time for which application threads were stopped: 0.0013348 seconds
[GC (Allocation Failure) [PSYoungGen: 5630K->4608K(7680K)] 10276K->10382K(530432K), 0.0016113 secs] [Times: user=0.03 sys=0.00, real=0.00 secs] 
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.111: ParallelGCFailedAllocation       [      11          0              1    ]      [     0     0     0     0     1    ]  0   
Total time for which application threads were stopped: 0.0016921 seconds
0-->Delta time from Start[0ms], from last[0ms]
[GC (Allocation Failure) [PSYoungGen: 7678K->3328K(8704K)] 13452K->13341K(531456K), 0.0013899 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.113: ParallelGCFailedAllocation       [      11          0              1    ]      [     0     0     1     0     1    ]  0   
Total time for which application threads were stopped: 0.0024590 seconds
[GC (Allocation Failure) [PSYoungGen: 6399K->3360K(11264K)] 16412K->16419K(534016K), 0.0011903 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.115: ParallelGCFailedAllocation       [      11          1              2    ]      [     0     0     0     0     1    ]  0   
Total time for which application threads were stopped: 0.0013046 seconds
[GC (Allocation Failure) [PSYoungGen: 8477K->5536K(11264K)] 21537K->21642K(534016K), 0.0020710 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.117: ParallelGCFailedAllocation       [      11          0              1    ]      [     0     0     0     0     2    ]  0   
Total time for which application threads were stopped: 0.0021520 seconds
1-->Delta time from Start[210ms], from last[210ms]
2-->Delta time from Start[410ms], from last[200ms]
3-->Delta time from Start[610ms], from last[200ms]
4-->Delta time from Start[810ms], from last[200ms]
5-->Delta time from Start[1010ms], from last[200ms]
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
1.119: no vm operation                  [      11          2              2    ]      [     0     0     0     0     0    ]  2   
Total time for which application threads were stopped: 0.0001547 seconds
6-->Delta time from Start[1210ms], from last[200ms]
7-->Delta time from Start[1410ms], from last[200ms]
8-->Delta time from Start[1610ms], from last[200ms]
9-->Delta time from Start[1810ms], from last[200ms]
10-->Delta time from Start[2010ms], from last[200ms]
11-->Delta time from Start[2210ms], from last[200ms]
12-->Delta time from Start[2410ms], from last[200ms]
13-->Delta time from Start[2610ms], from last[200ms]
14-->Delta time from Start[2810ms], from last[200ms]
15-->Delta time from Start[3010ms], from last[200ms]
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
3.120: no vm operation                  [      11          2              2    ]      [     0     0     0     0     0    ]  2   
Total time for which application threads were stopped: 0.0000670 seconds
16-->Delta time from Start[3210ms], from last[200ms]
17-->Delta time from Start[3410ms], from last[200ms]
18-->Delta time from Start[3610ms], from last[200ms]
19-->Delta time from Start[3810ms], from last[200ms]
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
4.084: EnableBiasedLocking              [      11          2              2    ]      [     0     0     0     0     0    ]  2   
Total time for which application threads were stopped: 0.0000737 seconds
20-->Delta time from Start[4010ms], from last[200ms]
21-->Delta time from Start[4210ms], from last[200ms]
22-->Delta time from Start[4410ms], from last[200ms]
23-->Delta time from Start[4610ms], from last[200ms]
24-->Delta time from Start[4810ms], from last[200ms]
25-->Delta time from Start[5010ms], from last[200ms]
26-->Delta time from Start[5210ms], from last[200ms]
27-->Delta time from Start[5410ms], from last[200ms]
28-->Delta time from Start[5610ms], from last[200ms]
29-->Delta time from Start[5810ms], from last[200ms]
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
6.084: no vm operation                  [      11          2              2    ]      [     0     0     0     0     0    ]  2   
Total time for which application threads were stopped: 0.0000621 seconds
30-->Delta time from Start[6010ms], from last[200ms]
31-->Delta time from Start[6210ms], from last[200ms]
32-->Delta time from Start[6410ms], from last[200ms]
33-->Delta time from Start[6610ms], from last[200ms]
34-->Delta time from Start[6810ms], from last[200ms]
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
7.085: no vm operation                  [      11          2              2    ]      [     0     0     0     0     0    ]  2   
Total time for which application threads were stopped: 0.0000605 seconds
35-->Delta time from Start[7010ms], from last[200ms]
36-->Delta time from Start[7210ms], from last[200ms]
37-->Delta time from Start[7410ms], from last[200ms]
38-->Delta time from Start[7610ms], from last[200ms]
39-->Delta time from Start[7810ms], from last[200ms]
40-->Delta time from Start[8010ms], from last[200ms]
41-->Delta time from Start[8210ms], from last[200ms]
42-->Delta time from Start[8410ms], from last[200ms]
43-->Delta time from Start[8610ms], from last[200ms]
44-->Delta time from Start[8810ms], from last[200ms]
45-->Delta time from Start[9010ms], from last[200ms]
46-->Delta time from Start[9210ms], from last[200ms]
47-->Delta time from Start[9410ms], from last[200ms]
48-->Delta time from Start[9610ms], from last[200ms]
49-->Delta time from Start[9810ms], from last[200ms]
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
10.087: no vm operation                  [      11          2              2    ]      [  3009     0  3009     0     0    ]  2   
Total time for which application threads were stopped: 3.0091028 seconds
50-->Delta time from Start[12994ms], from last[3184ms]
51-->Delta time from Start[13194ms], from last[200ms]
52-->Delta time from Start[13394ms], from last[200ms]
53-->Delta time from Start[13594ms], from last[200ms]
54-->Delta time from Start[13794ms], from last[200ms]
55-->Delta time from Start[13994ms], from last[200ms]
56-->Delta time from Start[14194ms], from last[200ms]
57-->Delta time from Start[14394ms], from last[200ms]
58-->Delta time from Start[14594ms], from last[200ms]
59-->Delta time from Start[14794ms], from last[200ms]
60-->Delta time from Start[14994ms], from last[200ms]
61-->Delta time from Start[15194ms], from last[200ms]
62-->Delta time from Start[15394ms], from last[200ms]
63-->Delta time from Start[15594ms], from last[200ms]
64-->Delta time from Start[15794ms], from last[200ms]
65-->Delta time from Start[15994ms], from last[200ms]
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
16.098: no vm operation                  [      11          2              2    ]      [  1221     0  1221     0     0    ]  2   
Total time for which application threads were stopped: 1.2213130 seconds
66-->Delta time from Start[17217ms], from last[1223ms]
67-->Delta time from Start[17417ms], from last[200ms]
68-->Delta time from Start[17617ms], from last[200ms]
69-->Delta time from Start[17817ms], from last[200ms]
70-->Delta time from Start[18017ms], from last[200ms]
71-->Delta time from Start[18217ms], from last[200ms]
72-->Delta time from Start[18417ms], from last[200ms]
73-->Delta time from Start[18617ms], from last[200ms]
74-->Delta time from Start[18817ms], from last[200ms]
75-->Delta time from Start[19017ms], from last[200ms]
76-->Delta time from Start[19217ms], from last[200ms]
77-->Delta time from Start[19417ms], from last[200ms]
78-->Delta time from Start[19617ms], from last[200ms]
79-->Delta time from Start[19817ms], from last[200ms]
80-->Delta time from Start[20017ms], from last[200ms]

Process finished with exit code 1

Now when I run this thing with the MAKE_THE_JVM_PAUSE_UNEXPECTEDLY boolean set to false, it causes no slowdowns at any of the safepoints.

Why does placing this code within a method cause this?

alife
  • 411
  • 4
  • 14
  • 1
    Have you tried running with `-XX:+UseCountedLoopSafepoints`? – the8472 Feb 05 '18 at 17:05
  • Interesting. I'll try it on the Java8 run example I gave, but what would the Java9 version of that be? They scrapped the entire `-XX` mechanism and went with an option format that doesn't make sense to me, hence the reason I have no JVM output for the Java9 version of my code. – alife Feb 05 '18 at 17:25
  • Not sure what you mean, most `XX` options continue to work in java 9, although some [logging options changed](https://stackoverflow.com/questions/41678557/how-to-list-syntax-and-available-tags-for-xlog) among others. Note that XX options are subject to change between any released version. You can use them but you have to expect to adjust them whenever you change versions. – the8472 Feb 05 '18 at 17:29
  • The same problem as [here](https://stackoverflow.com/a/38427546/3448419). Why there is no pause in a single method - because the method is compiled differently, using OSR (on-stack replacement) strategy. – apangin Feb 05 '18 at 17:43
  • @apangin, it's the other way around in my case. The method is the one with the delays. – alife Feb 05 '18 at 18:10
  • @alife I mean when `MAKE_THE_JVM_PAUSE_UNEXPECTEDLY=true`, sweep method is compiled separately, and this is a normal compilation with safepoint polls removed (hence the delay). When all the code is in one big method, that method is OSR-compiled. I.e. it is not fully optimized, and the safepoint polls are not removed. – apangin Feb 05 '18 at 19:04

1 Answers1

2

Since the safepoint happens roughly every second it is probably the GuaranteedSafepointInterval in action, i.e. it is triggered by a timer.

I have not tested your code, but it most likely takes long to reach the safepoint once it has been triggered because there is an optimization that elides safepoint polling in loops that are known by the JIT to terminate in a finite amount of time or call other methods which will have safepoint polls in their bodies. The issue is that "finite amount of time" can still be large for CPU-heavy pure computation loops and thus lead to safepoint stalls. In java 8 and 9 this can be disabled at a slight performance penalty with -XX:+UseCountedLoopSafepoints. Java 10 will introduce an optimization to balance between safepoint latency in finite loops and per-loop polling (JDK-8186027).

the8472
  • 40,999
  • 5
  • 70
  • 122
  • The JVM (1.8.0) is punting immediately with an error saying that that option doesn't exist. Similar to this bug report [RFR(XS): 8162603 - Unrecognized VM option 'UseCountedLoopSafepoints'](http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2016-July/023916.html). – alife Feb 05 '18 at 18:04
  • Further, the error message is obtuse. It's asking _"Did you mean '(+/-)UseLoopSafepoints'?"_ So I tried THAT option as well and then it replied something humorous: _"Unrecognized VM option 'UseLoopSafepoints' Did you mean '(+/-)UseLoopSafepoints'?"_ My option used in that 2nd case was this: `-XX:+UseLoopSafepoints`. **(????)** – alife Feb 05 '18 at 18:07
  • I'm using an Oracle JVM though (from within Intellij IDEA), not OpenJDK's, so the bug report I pointed to is just for comparison. When I run using the JDK 1.8.0 directly from a command line, I get the same results. – alife Feb 05 '18 at 18:12
  • Perhaps I need Java8u92 and not .0. I'll download that and report back. – alife Feb 05 '18 at 18:16
  • The option seems to be conditional on server VMs, but afaik all 64bit hotspot JVMs should be classified as server JVMs these days, odd. Maybe you should try with java 9. "Oracle JVM though, not OpenJDK's" they use largely the same codebase. – the8472 Feb 05 '18 at 18:17
  • you should always use the latest version for the respective releases, for 8 that would be u162. – the8472 Feb 05 '18 at 18:18
  • Actually, yeah, _161 is what I had downloaded. Your suggested option `-XX:+UseCountedLoopSafepoints` works perfectly! Thanks everyone. – alife Feb 05 '18 at 20:47
  • @the8472 not only “these days”; there never was a 64bit client JVM… – Holger Feb 06 '18 at 12:33
  • Can you explain something further, (and @apangin and others too if you have a second)? When I run *without* the `-XX:+UseCountedLoopSafepoints` options on, and hence cause the delay within `sweep()`: 1. If I change the array access to a calculation it doesn't pause. 2. If I change the array access to a calculation that involves another method call (in this case to `randomInt()`), it *does* pause. See the image [here](https://drive.google.com/open?id=12-J4h-dKiXMPlAZ-ojimC20MfLmIyNBt). – alife Feb 06 '18 at 17:30
  • @alife I suggest you look at the generated assembly code and optimization event logging if you're curious. These things are highly sensitive to decisions made by the optimizing passes and multiple JIT stages. Minor changes can result in different outcomes. It's better to not tweak the java code if your goal is to avoid this problem because the tweaks could easily become ineffective if circumstances change (java versions, calling methods, inlining decisions...). That's why the flag exists, to have a direct switch to flip if it becomes an issue instead of indirect tweaks. – the8472 Feb 06 '18 at 17:54
  • Also, the last variant is probably much faster, so can reach the safepoint faster. If you have additional questions, please file new ones instead of using the comments. – the8472 Feb 06 '18 at 17:57