2

We're using ZGC on JDK 17.0.1. The application successfully runs without unexpected GC stalls for a period of time, after which it gets into a state where the ZGC does not free much memory when the GC is being done. This results in multiple periodic Allocation Stalls within the application from then on.

The application itself does a lot of allocation at a periodic interval of around 35 milliseconds. This is roughly 2GB/s allocation when it’s at peak load. On a typical start in the morning, there is about 30 minutes in which the load is low, till a point at which external data starts arriving, and load is high from then on. The load profile does change as the application is running as a result. Due to the predictable allocation during load, we've set the collection at an interval using the ZCollectionInterval.

We cannot consistently reproduce the Allocation Stalls, even between different machines where the load is very similar. Oddly, if the application is restarted while the load is already high, then it does not typically run into issues.

We did play around with the UseDynamicNumberOfGCThreads as well previously, but it seems to come up with a number of threads that is 2. The chart attached uses SoftMaxHeapSize=1G. This is a bit aggressive, but we did see the same issue without setting the SofMaxHeapSize. We did also notice that the Concurrent Mark phase of the GC seems to increase leading up to this issue. We also didn’t see anything of note about ZGC within the Release notes of JDK versions up till 17.0.5.

Has anyone run into similar issues before? We're unsure as to why this happens or if this is expected behavior when setting the flags as such. As much hasn't been set and the rarity at which it occurs, we were hoping some one can give some pointers as to if this is actually a known issue with an existing fix or we're just setting it up wrong.

These are the relevant flags.

-XX:-OmitStackTraceInFastThrow
-XX:CICompilerCount=2
-XX:+DisableExplicitGC
-XX:ParallelGCThreads=4
-XX:ConcGCThreads=4
-Xmx8192m
-Xms8192m
-Xss8m
-Xlog:gc*=info:file=${LT_APP_LOG_PATH_FORMAT}.gc:time,level,tags:filecount=32,filesize=64M
-XX:+UseZGC
-XX:+AlwaysPreTouch
-XX:+UseLargePages
-XX:+UseTransparentHugePages
-XX:ZCollectionInterval=4
-Djdk.nio.maxCachedBufferSize=262144
-XX:+HeapDumpOnOutOfMemoryError
-XX:SoftMaxHeapSize=1G
[2022-11-24T09:21:12.378+0530][info][gc,phases   ] GC(669) Concurrent Select Relocation Set 8.713ms
[2022-11-24T09:21:12.378+0530][info][gc,phases   ] GC(669) Pause Relocate Start 0.008ms
[2022-11-24T09:21:12.420+0530][info][gc,phases   ] GC(669) Concurrent Relocate 41.564ms
[2022-11-24T09:21:12.420+0530][info][gc,load     ] GC(669) Load: 36.38/31.86/25.10
[2022-11-24T09:21:12.420+0530][info][gc,mmu      ] GC(669) MMU: 2ms/99.1%, 5ms/99.6%, 10ms/99.8%, 20ms/99.9%, 50ms/99.9%, 100ms/100.0%
[2022-11-24T09:21:12.420+0530][info][gc,marking  ] GC(669) Mark: 4 stripe(s), 2 proactive flush(es), 1 terminate flush(es), 0 completion(s), 0 continuation(s) 
[2022-11-24T09:21:12.420+0530][info][gc,marking  ] GC(669) Mark Stack Usage: 32M
[2022-11-24T09:21:12.420+0530][info][gc,nmethod  ] GC(669) NMethods: 10353 registered, 328 unregistered
[2022-11-24T09:21:12.420+0530][info][gc,metaspace] GC(669) Metaspace: 44M used, 46M committed, 1072M reserved
[2022-11-24T09:21:12.420+0530][info][gc,ref      ] GC(669) Soft: 2311 encountered, 98 discovered, 0 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,ref      ] GC(669) Weak: 2559 encountered, 652 discovered, 0 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,ref      ] GC(669) Final: 56 encountered, 7 discovered, 0 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,ref      ] GC(669) Phantom: 27425 encountered, 27313 discovered, 27030 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,reloc    ] GC(669) Small Pages: 5764 / 11528M, Empty: 4752M, Relocated: 58M, In-Place: 0
[2022-11-24T09:21:12.420+0530][info][gc,reloc    ] GC(669) Medium Pages: 3 / 96M, Empty: 64M, Relocated: 0M, In-Place: 0
[2022-11-24T09:21:12.420+0530][info][gc,reloc    ] GC(669) Large Pages: 0 / 0M, Empty: 0M, Relocated: 0M, In-Place: 0
[2022-11-24T09:21:12.420+0530][info][gc,reloc    ] GC(669) Forwarding Usage: 29M
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669) Min Capacity: 16384M(100%)
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669) Max Capacity: 16384M(100%)
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669) Soft Max Capacity: 1024M(6%)
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)                Mark Start          Mark End        Relocate Start      Relocate End           High               Low         
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)  Capacity:    16384M (100%)      16384M (100%)      16384M (100%)      16384M (100%)      16384M (100%)      16384M (100%)   
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)      Free:     4760M (29%)           0M (0%)         4788M (29%)        5392M (33%)        5394M (33%)           0M (0%)     
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)      Used:    11624M (71%)       16384M (100%)      11596M (71%)       10992M (67%)       16384M (100%)      10990M (67%)    
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)      Live:         -              5860M (36%)        5860M (36%)        5860M (36%)            -                  -          
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669) Allocated:         -              4760M (29%)        4788M (29%)        4858M (30%)            -                  -          
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)   Garbage:         -              5763M (35%)         947M (6%)          273M (2%)             -                  -          
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669) Reclaimed:         -                  -              4816M (29%)        5490M (34%)            -                  -          
[2022-11-24T09:21:12.420+0530][info][gc          ] GC(669) Garbage Collection (Allocation Stall) 11624M(71%)->10992M(67%)
[2022-11-24T09:21:12.479+0530][info][gc,start    ] GC(670) Garbage Collection (Allocation Stall)
[2022-11-24T09:21:12.479+0530][info][gc,ref      ] GC(670) Clearing All SoftReferences
[2022-11-24T09:21:12.479+0530][info][gc,task     ] GC(670) Using 4 workers
[2022-11-24T09:21:12.479+0530][info][gc,phases   ] GC(670) Pause Mark Start 0.009ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ] === Garbage Collection Statistics =======================================================================================================================
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]                                                              Last 10s              Last 10m              Last 10h                Total
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]                                                              Avg / Max             Avg / Max             Avg / Max             Avg / Max
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]   Collector: Garbage Collection Cycle                   3410.686 / 3486.475    844.670 / 3486.475    464.952 / 3486.475    464.952 / 3486.475    ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]  Contention: Mark Segment Reset Contention                     7 / 15                3 / 30                1 / 30                1 / 30          ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]  Contention: Mark SeqNum Reset Contention                      0 / 1                 0 / 3                 0 / 3                 0 / 3           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: Allocation Stall                                  1 / 5                 0 / 7                 0 / 7                 0 / 7           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: Allocation Stall                            677.602 / 900.858     303.367 / 900.858     303.367 / 900.858     303.367 / 900.858     ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: GC Locker Stall                                   0 / 0                 0 / 0                 0 / 1                 0 / 1           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: GC Locker Stall                               0.000 / 0.000         0.000 / 0.000         0.033 / 0.058         0.033 / 0.058       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: Relocation Stall                                  0 / 0                 0 / 0                 0 / 0                 0 / 0           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: Relocation Stall                              0.000 / 0.000         0.000 / 0.000         0.000 / 0.000         0.000 / 0.000       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Allocation Rate                                1479 / 1860           1003 / 2264            329 / 2264            329 / 2264        MB/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Out Of Memory                                     0 / 0                 0 / 0                 0 / 0                 0 / 0           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Page Cache Flush                                  0 / 0                 0 / 32                0 / 32                0 / 32          MB/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Page Cache Hit L1                               198 / 827             177 / 901              59 / 901              59 / 901         ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Page Cache Hit L2                               541 / 930             316 / 1052            103 / 1052            103 / 1052        ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Page Cache Hit L3                                 1 / 15               12 / 657               4 / 657               4 / 657         ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Page Cache Miss                                   0 / 0                 0 / 1                 0 / 1                 0 / 1           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Uncommit                                          0 / 0                 0 / 0                 0 / 0                 0 / 0           MB/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Undo Object Allocation Failed                     0 / 0                 0 / 1                 0 / 1                 0 / 1           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Undo Object Allocation Succeeded                  0 / 3                 4 / 855               1 / 855               1 / 855         ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Undo Page Allocation                              0 / 3                 0 / 4                 0 / 4                 0 / 4           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Mark                            3356.120 / 3437.684    812.574 / 3437.684    445.012 / 3437.684    445.012 / 3437.684    ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Mark Continue                      0.000 / 0.000         0.000 / 0.000         0.000 / 0.000         0.000 / 0.000       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Mark Free                          0.001 / 0.001         0.001 / 0.002         0.001 / 0.002         0.001 / 0.002       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Process Non-Strong References      8.650 / 8.848        12.746 / 47.124        8.067 / 47.124        8.067 / 47.124      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Relocate                          36.550 / 41.564       14.436 / 55.995        8.357 / 55.995        8.357 / 55.995      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Reset Relocation Set               0.067 / 0.080         0.023 / 0.080         0.013 / 0.080         0.013 / 0.080       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Select Relocation Set              7.214 / 8.713         3.382 / 29.365        2.579 / 29.365        2.579 / 29.365      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Pause Mark End                                0.011 / 0.011         0.011 / 0.018         0.009 / 0.018         0.009 / 0.018       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Pause Mark Start                              0.008 / 0.009         0.008 / 0.016         0.008 / 0.016         0.008 / 0.016       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Pause Relocate Start                          0.008 / 0.008         0.007 / 0.013         0.006 / 0.013         0.006 / 0.013       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Classes Purge                      0.266 / 0.497         0.411 / 5.255         0.251 / 5.255         0.251 / 5.255       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Classes Unlink                     5.018 / 5.358         9.216 / 28.725        5.991 / 28.725        5.991 / 28.725      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Mark                            3352.212 / 3437.123    855.661 / 3437.123    454.341 / 3437.123    454.341 / 3437.123    ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Mark Try Flush                     0.030 / 0.055         0.760 / 42.368        0.442 / 42.368        0.442 / 42.368      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Mark Try Terminate                 0.703 / 1.059         1.393 / 27.549        1.126 / 27.549        1.126 / 27.549      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent References Enqueue                 0.003 / 0.004         0.003 / 0.013         0.002 / 0.013         0.002 / 0.013       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent References Process                 3.055 / 3.524         1.183 / 9.454         0.655 / 9.454         0.655 / 9.454       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Roots ClassLoaderDataGraph         0.090 / 0.223         0.135 / 12.400        0.105 / 12.400        0.105 / 12.400      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Roots CodeCache                    0.000 / 0.000         0.000 / 0.000         0.000 / 0.000         0.000 / 0.000       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Roots JavaThreads                  0.411 / 1.162         0.520 / 12.391        0.407 / 12.391        0.407 / 12.391      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Roots OopStorageSet                0.018 / 0.038         0.051 / 12.043        0.030 / 12.043        0.030 / 12.043      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Weak Roots OopStorageSet           0.259 / 0.263         0.449 / 4.890         0.331 / 4.890         0.331 / 4.890       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Pause Mark Try Complete                       0.000 / 0.000         0.000 / 0.000         0.000 / 0.000         0.000 / 0.000       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      System: Java Threads                                     28 / 28               26 / 29               21 / 29               21 / 29          threads
[2022-11-24T09:21:13.180+0530][info][gc,stats    ] =========================================================================================================================================================
[2022-11-24T09:21:15.944+0530][info][gc,phases   ] GC(670) Concurrent Mark 3465.071ms
[2022-11-24T09:21:15.944+0530][info][gc,phases   ] GC(670) Pause Mark End 0.011ms
[2022-11-24T09:21:15.944+0530][info][gc,phases   ] GC(670) Concurrent Mark Free 0.001ms
[2022-11-24T09:21:15.953+0530][info][gc,phases   ] GC(670) Concurrent Process Non-Strong References 8.537ms
[2022-11-24T09:21:15.953+0530][info][gc,phases   ] GC(670) Concurrent Reset Relocation Set 0.062ms
[2022-11-24T09:21:15.953+0530][info][gc          ] Allocation Stall (lt-primary-pool-2) 704.524ms
[2022-11-24T09:21:15.953+0530][info][gc          ] Allocation Stall (lt-primary-pool-0) 704.168ms
[2022-11-24T09:21:15.953+0530][info][gc          ] Allocation Stall (lt-secondary-pool-0) 711.668ms
[2022-11-24T09:21:15.955+0530][info][gc          ] Allocation Stall (lt-primary-pool-1) 707.109ms
[2022-11-24T09:21:15.955+0530][info][gc          ] Allocation Stall (lt-secondary-pool-1) 713.196ms
[2022-11-24T09:21:15.957+0530][info][gc,phases   ] GC(670) Concurrent Select Relocation Set 4.082ms

java version "17.0.1" 2021-10-19 LTS
Java(TM) SE Runtime Environment (build 17.0.1+12-LTS-39)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.1+12-LTS-39, mixed mode, sharing)

Graphs created with GCViewer on the GC logs.

ZGC clean up issues ZGC clean up issues

No issues allocation when everything is fine No issues allocation when everything is fine

MT0
  • 143,790
  • 11
  • 59
  • 117
0x00
  • 141
  • 2
  • 10
  • I have no hands-on experience with zgc but there are somethings that stick out here. I note that you have MX == MS. With other GC:s that means that heap ergonomics is turned off. If that is the same with zgc, you might consider setting MS different from MX. I also note that you have THP turned on. I've run into excessive gc times with THP turned on due to memory page coalescing. Turning it off was our solution. Assuming you're not running windows, we caught that with iostat. Hope this helps somewhat. – Erik Dec 01 '22 at 09:45
  • I would have expected THP to reduce the likelihood of paging time during high allocation. But it is something we can try, true. The size settings are set to the same to ensure that no memory resizing happens for the heap as the work load is fixed. ZGC unexpectedly has quite good performance overall, and memory sizing we used to do with G1 can be definitely smaller. We have not really reduced the heap sizes since our original G1 GC setup. – 0x00 Dec 02 '22 at 07:59

1 Answers1

2

We did not get to a reason for the issues, however, we did make some progress in examining the issue closer.

When examing the ZGC code for the OpenJDK from github.

https://github.com/openjdk/zgc/blob/1f8813495e8184b6c38319df1c2cb70de7811a76/src/hotspot/share/gc/z/zDirector.cpp#L184

It has code to work out an allocation rate. We also enabled the debug logging of the ZGC.

-Xlog:gc*=debug:file=gc.log:time,level,tags

The logging that looks like the final lines below, predicts the allocation rate to be 0 for extended periods of time for the application start. We suspect this might be one of the reasons for the issue. However, in looking at the GCGraphs it doesn't fully explain why it gets into some panic state when there is enough memory during normal function.

We set the -XX:ZAllocationSpikeTolerance=10 to ensure that the estimated memory use would get close to our normal allocation rate of 2000MB/s. The predicted memory usage for the ZGC from that point on appeared to be correct. We've run this settings for a while now on our services without issues so far. We also removed the -XX:SoftMaxHeapSize= as the GC frequency required seems to be correctly estimated by just the XX:ZAllocationSpikeTolerance.

Example debug logging for Predicted allocation rate.

[2022-12-02T12:45:29.825+0530][debug][gc,director    ] Rule: Allocation Rate (Dynamic GC Workers), MaxAllocRate: 9624.6MB/s (+/-0.6%), Free: 3412MB, GCCPUTime: 0.136, GCDuration: 0.136s, TimeUntilOOM: 0.352s, TimeUntilGC: 0.116s, GCWorkers: 1 -> 1
[2022-12-02T12:45:29.825+0530][debug][gc,director    ] Rule: High Usage, Free: 3412MB(83.3%)
[2022-12-02T12:45:29.825+0530][debug][gc,director    ] Rule: Proactive, AcceptableGCInterval: 3.394s, TimeSinceLastGC: 0.473s, TimeUntilGC: 2.921s
[2022-12-02T12:45:29.925+0530][debug][gc,alloc       ] Allocation Rate: 958.0MB/s, Predicted: 960.9MB/s, Avg: 954.4(+/-5.7)MB/s
[2022-12-02T12:45:29.925+0530][debug][gc,director    ] Rule: Timer, Interval: 4.000s, TimeUntilGC: 3.427s
[2022-12-02T12:45:29.925+0530][debug][gc,director    ] Select GC Workers (Normal), AvoidLongGCWorkers: 0.013, AvoidOOMGCWorkers: 0.393, LastGCWorkers: 1.000, GCWorkers: 0.393
0x00
  • 141
  • 2
  • 10
  • We finally managed to get to the bottom of the issue. It isn't particularly obvious at all but we were incorrectly using compareAndExchange instead of compareAndSet of an atomic variable. When using booleans, you might just end up checking the wrong value to compare against. When attempting to throttle internal logic, we were not actually throttling at all. Causing extremely fast allocation in some sections. Not a ZGC issue thankfully. – 0x00 Mar 08 '23 at 09:43