Long incidental Young garbage collection pauses

Question

We tune our GC for minimum "stop-the-world" pauses. Perm and Tenured generations behave well. Young works fine most of the time, and the pauses usually don't exceed 500ms (note [Times: user=0.35 sys=0.02, real=0.06 secs]):

{Heap before GC invocations=11603 (full 60):
 par new generation   total 3640320K, used 3325226K [0x0000000600000000, 0x00000006f6e00000, 0x00000006f6e00000)
  eden space 3235840K, 100% used [0x0000000600000000, 0x00000006c5800000, 0x00000006c5800000)
  from space 404480K,  22% used [0x00000006de300000, 0x00000006e3a4a898, 0x00000006f6e00000)
  to   space 404480K,   0% used [0x00000006c5800000, 0x00000006c5800000, 0x00000006de300000)
 concurrent mark-sweep generation total 4147200K, used 1000363K [0x00000006f6e00000, 0x00000007f4000000, 0x00000007f4000000)
 concurrent-mark-sweep perm gen total 196608K, used 133030K [0x00000007f4000000, 0x0000000800000000, 0x0000000800000000)
2012-07-16T14:36:05.641+0200: 427048.412: [GC 427048.412: [ParNew
Desired survivor size 207093760 bytes, new threshold 4 (max 4)
- age   1:    8688880 bytes,    8688880 total
- age   2:   12432832 bytes,   21121712 total
- age   3:   18200488 bytes,   39322200 total
- age   4:   20473336 bytes,   59795536 total
: 3325226K->75635K(3640320K), 0.0559610 secs] 4325590K->1092271K(7787520K), 0.0562630 secs] [Times: user=0.35 sys=0.02, real=0.06 secs] 
Heap after GC invocations=11604 (full 60):
 par new generation   total 3640320K, used 75635K [0x0000000600000000, 0x00000006f6e00000, 0x00000006f6e00000)
  eden space 3235840K,   0% used [0x0000000600000000, 0x0000000600000000, 0x00000006c5800000)
  from space 404480K,  18% used [0x00000006c5800000, 0x00000006ca1dcf40, 0x00000006de300000)
  to   space 404480K,   0% used [0x00000006de300000, 0x00000006de300000, 0x00000006f6e00000)
 concurrent mark-sweep generation total 4147200K, used 1016635K [0x00000006f6e00000, 0x00000007f4000000, 0x00000007f4000000)
 concurrent-mark-sweep perm gen total 196608K, used 133030K [0x00000007f4000000, 0x0000000800000000, 0x0000000800000000)
}

However, sometimes, typically once a day, we get exceptionally long Young garbage collection time (note [Times: user=0.41 sys=0.01, real=5.51 secs]):

{Heap before GC invocations=7884 (full 37):
 par new generation   total 3640320K, used 3304448K [0x0000000600000000, 0x00000006f6e00000, 0x00000006f6e00000)
  eden space 3235840K, 100% used [0x0000000600000000, 0x00000006c5800000, 0x00000006c5800000)
  from space 404480K,  16% used [0x00000006c5800000, 0x00000006c9b00370, 0x00000006de300000)
  to   space 404480K,   0% used [0x00000006de300000, 0x00000006de300000, 0x00000006f6e00000)
 concurrent mark-sweep generation total 4147200K, used 1967225K [0x00000006f6e00000, 0x00000007f4000000, 0x00000007f4000000)
 concurrent-mark-sweep perm gen total 189100K, used 112825K [0x00000007f4000000, 0x00000007ff8ab000, 0x0000000800000000)
2012-07-15T01:23:25.474+0200: 293088.245: [GC 293093.636: [ParNew
Desired survivor size 207093760 bytes, new threshold 4 (max 4)
- age   1:   30210472 bytes,   30210472 total
- age   2:   11614600 bytes,   41825072 total
- age   3:    8591680 bytes,   50416752 total
- age   4:    7779600 bytes,   58196352 total
: 3304448K->73854K(3640320K), 0.1158280 secs] 5271674K->2046454K(7787520K), 0.1181990 secs] [Times: user=0.41 sys=0.01, real=5.51 secs] 
Heap after GC invocations=7885 (full 37):
 par new generation   total 3640320K, used 73854K [0x0000000600000000, 0x00000006f6e00000, 0x00000006f6e00000)
  eden space 3235840K,   0% used [0x0000000600000000, 0x0000000600000000, 0x00000006c5800000)
  from space 404480K,  18% used [0x00000006de300000, 0x00000006e2b1fb40, 0x00000006f6e00000)
  to   space 404480K,   0% used [0x00000006c5800000, 0x00000006c5800000, 0x00000006de300000)
 concurrent mark-sweep generation total 4147200K, used 1972599K [0x00000006f6e00000, 0x00000007f4000000, 0x00000007f4000000)
 concurrent-mark-sweep perm gen total 189100K, used 112825K [0x00000007f4000000, 0x00000007ff8ab000, 0x0000000800000000)
}

Below is the relevant output from jstat -gccause:

Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC                 GCC                 
   293083.2  16.96   0.00  96.26  47.44  59.66   7884  772.326    74   10.893  783.219 unknown GCCause      No GC               
   293083.2  16.96   0.00  96.26  47.44  59.66   7884  772.326    74   10.893  783.219 unknown GCCause      No GC               
   293084.2  16.96   0.00  97.69  47.44  59.66   7884  772.326    74   10.893  783.219 unknown GCCause      No GC               
   293085.3  16.96   0.00  98.32  47.44  59.66   7884  772.326    74   10.893  783.219 unknown GCCause      No GC               
   293086.3  16.96   0.00  99.32  47.44  59.66   7884  772.326    74   10.893  783.219 unknown GCCause      No GC               
Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC                 GCC                 
   293087.3  16.96   0.00  99.55  47.44  59.66   7884  772.326    74   10.893  783.219 unknown GCCause      No GC               
   293088.3  16.96   0.00 100.00  47.44  59.66   7884  772.326    74   10.893  783.219 unknown GCCause      Allocation Failure  
   293089.3  16.96   0.00 100.00  47.44  59.66   7884  772.326    74   10.893  783.219 unknown GCCause      Allocation Failure  
   293090.3  16.96   0.00 100.00  47.44  59.66   7884  772.326    74   10.893  783.219 unknown GCCause      Allocation Failure  
   293091.3  16.96   0.00 100.00  47.44  59.66   7884  772.326    74   10.893  783.219 unknown GCCause      Allocation Failure  
   293092.3  16.96   0.00 100.00  47.44  59.66   7884  772.326    74   10.893  783.219 unknown GCCause      Allocation Failure  
   293093.3  16.96   0.00 100.00  47.44  59.66   7884  772.326    74   10.893  783.219 unknown GCCause      Allocation Failure  
   293084.2  16.96   0.00 100.00  47.44  59.66   7884  772.326    74   10.893  783.219 unknown GCCause      Allocation Failure  
   293094.3   0.00  18.26   6.23  47.56  59.66   7885  772.442    74   10.893  783.334 unknown GCCause      No GC               
   293094.6   0.00  18.26   6.71  47.56  59.66   7885  772.442    74   10.893  783.334 unknown GCCause      No GC               
   293095.3   0.00  18.26   6.85  47.56  59.66   7885  772.442    74   10.893  783.334 unknown GCCause      No GC               
   293095.6   0.00  18.26   6.92  47.56  59.66   7885  772.442    74   10.893  783.334 unknown GCCause      No GC               
   293096.2   0.00  18.26   9.84  47.56  59.66   7885  772.442    74   10.893  783.334 unknown GCCause      No GC               
   293096.6   0.00  18.26  10.11  47.56  59.66   7885  772.442    74   10.893  783.334 unknown GCCause      No GC

"Allocation Failure" appears as the GC cause in other places as well, but always as a single entry. When it comes in the sequence like this, it is associated with a long GC pause. I looked into Oracle JVM sources, and "Allocation Failure" looks like a pretty natural situation: Young is out of space for a new data and it's time to clean up. I checked for any memory intensive, unexpected actions in the system before the pause happened, but found nothing suspicious.

Please note, that the Young Garbage Collection Time doesn't rise much during the pause time. My memory and GC setting are as following (logging settings omitted):

-Xms6000m 
-Xmn2950m 
-Xmx6000m 
-XX:MaxPermSize=192m 
-XX:+UseConcMarkSweepGC 
-XX:+UseParNewGC 
-Dsun.rmi.dgc.client.gcInterval=9223372036854775807 
-Dsun.rmi.dgc.server.gcInterval=9223372036854775807
-XX:CMSInitiatingOccupancyFraction=50 
-XX:+CMSScavengeBeforeRemark 
-XX:+CMSClassUnloadingEnabled

Tested also with 8000m and 12000m heaps. Machines:

8-core with 16GB of memory
24-core with 50GB of memory

So my basic question is: why ParNewGC incidentally behaves this way?

score 5 · Answer 1 · answered Jul 17 '12 at 08:32

5

Before a GC can be performed, it has to get every thread to a safe point (it doesn't just stop every thread immediately). If you have long running JNI calls or system calls, it can take a long time to reach a safe point. What you see in this situation is a long pause even though the collection itself was of a normal length in time.

answered Jul 17 '12 at 08:32

Peter Lawrey

525,659
79
751
1,130

Peter, does that mean that a new memory allocation request will block until the safe point is reached and memory reclaimed? Or will that request get fulfilled immediately from other partitions? – Marko Topolnik Jul 17 '12 at 08:51
The new memory allocation can trigger a GC which results in ALL threads reaching a safe point before the GC starts. This can mean waking threads which where stopped by the OS so they can be at a safe point. – Peter Lawrey Jul 17 '12 at 08:55
So all threads will block awaiting that one lagging thread to reach the safe point? – Marko Topolnik Jul 17 '12 at 08:56
1

This morning we hit a 1000sec pause: Times: user=1,27 sys=1115,44, real=1117,82 secs. Server down for 15minutes... with a 10g heap, maybe G1 can reduce this time? – nomoa Jul 17 '12 at 09:25
A very high sys time suggests the OS is struggling with the application. I would make sure there is no chance the application has been partially swapped to disk as it must be completely in main memory for the GC to run is a reasonable time. Every 4 KB swapped to disk can add 10 ms to the GC time. (In your case, just 400 MB swapped to disk could cause this much delay) – Peter Lawrey Jul 17 '12 at 09:30
In terms of tuning you GC, reducing the maximum heap size by 1 GB is likely to make much more difference than using G1. – Peter Lawrey Jul 17 '12 at 09:31
Many thanks (and sorry my intention was not to hijack your question), I'll look into that (it's a bit difficult to tune cause we make heavy use of VIRTUAL memory with lucene, I'm afraid that the OS swap some HEAP in favor of lucene mmaped pages). – nomoa Jul 17 '12 at 09:56
1

JNI doesn't qualify: "As a special case, threads running JNI code can continue to run, because they use only handles. During a safepoint they must block instead of loading the contents of the handle." from http://openjdk.java.net/groups/hotspot/docs/HotSpotGlossary.html (safepoint definition) – Wacław Borowiec Jul 17 '12 at 10:02

Long incidental Young garbage collection pauses

1 Answers1

Linked