2

I am trying to use zgc in product environment, so I updated jdk from jdk8 to openjdk 15, tomcat 8 to tomcat 8.5, and the gc related options, but the cpu usage goes to 1000+% after a few hours later of the jvm start(normal cpu usage should be 100-300%). Sometimes, when the cpu is high, there are many ICBufferFull safe point in the log file:

[2020-11-12T19:00:43.669+0800] Safepoint "ICBufferFull", Time since last: 41374119 ns, Reaching safepoint: 2026134 ns, At safepoint: 85993 ns, Total: 2112127 ns
[2020-11-12T19:00:43.672+0800] Safepoint "ICBufferFull", Time since last: 2521598 ns, Reaching safepoint: 1109894 ns, At safepoint: 57235 ns, Total: 1167129 ns
[2020-11-12T19:00:43.676+0800] Safepoint "ICBufferFull", Time since last: 2892867 ns, Reaching safepoint: 1240834 ns, At safepoint: 59633 ns, Total: 1300467 ns
[2020-11-12T19:00:43.681+0800] Safepoint "ICBufferFull", Time since last: 2870110 ns, Reaching safepoint: 1425837 ns, At safepoint: 54052 ns, Total: 1479889 ns

if make node offline about 30s, the cpu goes down, and then make it online, it will work normally for hours until the cpu goes high again. I checked the source code, ICBufferFull means "inline cache buffers being full" , but I cannot found the option to increase it, could any one help on this? thanks!

the gc options are:

export JAVA_OPTS='-Xms10g -Xmx10g -XX:+UseLargePages -XX:ZAllocationSpikeTolerance=5 -XX:ParallelGCThreads=8 -XX:ConcGCThreads=4 -Xss2m -XX:+UseZGC -Xlog:gc,gc+phases,safepoint:file=/logs/gc.log:t:filecount=10,filesize=10m -XX:+HeapDumpOnOutOfMemoryError'

edit1:

I have another host works well still running in jdk8 with cms as the reference, the requests to the 2 hosts are almost identical.

I profiled it with async-profile, the most hot method is java/lang/ThreadLocal$ThreadLocalMap.getEntryAfterMiss, which occurred 50+%, the most hot native method is ZMark::try_mark_object(ZMarkCache*, unsigned long, bool) , which occurred 0.41% only. I checked the thread local related source code in jdk8 and openjdk15, seems not changed.

edit2:

I found a similar bug on JIRA, my application is lucene related too, and from the gc log, when the weak references count is 1m+, the usage is high. so the question would be how to collect weak reference more aggressively in zgc?

edit3:

From the following logs with System.gc() called every 3s, it seems that my application produced too much weak reference. but it is strange that the produced count keep increasing after start. the requests is almost constant from 11:00-17:00. Note that cpu dropped from 600% to 400% automatically after GC(9821), the enqueued is ~250K. GC(10265) the node was offline, enqueued was ~770K. why the enqueued count is small in long time, does it mean the objects are not reclaimed fully?

[2020-11-19T11:00:00.245+0800] GC(992) Weak: 155658 encountered, 72334 discovered, 0 enqueued
[2020-11-19T12:00:00.397+0800] GC(2194) Weak: 220462 encountered, 122216 discovered, 1380 enqueued
[2020-11-19T12:00:03.411+0800] GC(2195) Weak: 220598 encountered, 107228 discovered, 677 enqueued
[2020-11-19T13:00:00.497+0800] GC(3395) Weak: 222536 encountered, 82199 discovered, 1713 enqueued
[2020-11-19T14:00:00.647+0800] GC(4613) Weak: 443946 encountered, 291651 discovered, 292 enqueued
[2020-11-19T15:00:01.173+0800] GC(5819) Weak: 338065 encountered, 124351 discovered, 815 enqueued
[2020-11-19T16:00:01.283+0800] GC(7022) Weak: 459070 encountered, 298932 discovered, 353 enqueued
[2020-11-19T17:00:01.426+0800] GC(8222) Weak: 688162 encountered, 519369 discovered, 4648 enqueued
[2020-11-19T16:00:01.283+0800] GC(7022) Weak: 459070 encountered, 298932 discovered, 353 enqueued
[2020-11-19T17:00:01.426+0800] GC(8222) Weak: 688162 encountered, 519369 discovered, 4648 enqueued
[2020-11-19T18:00:01.556+0800] GC(9430) Weak: 1078757 encountered, 928748 discovered, 1691 enqueued
[2020-11-19T18:18:43.595+0800] GC(9821) Weak: 1022080 encountered, 841168 discovered, 247352 enqueued
[2020-11-19T18:18:46.592+0800] GC(9822) Weak: 774253 encountered, 568564 discovered, 3938 enqueued
[2020-11-19T18:40:49.616+0800] GC(10265) Weak: 842081 encountered, 788825 discovered, 767288 enqueued
[2020-11-19T18:40:52.593+0800] GC(10266) Weak: 74876 encountered, 18186 discovered, 1 enqueued
xiaoxxcool
  • 311
  • 4
  • 8
  • Can you share the VM options? Are you using Graal JIT? This might be related: https://bugs.openjdk.java.net/browse/JDK-8241494 – spongebob Nov 13 '20 at 06:50
  • @FrancescoMenzani , add the jvm options, is default JIT Graal? I found this bug before, but it was long to the safe point, I got 200+ ICBufferFull in 2s, each one was very short, max total time is less than 3ms. – xiaoxxcool Nov 13 '20 at 07:20
  • 1
    Have you profiled Java process (e.g. with async-profiler) when the cpu usage is so high? – apangin Nov 13 '20 at 07:21
  • Why are you reducing the stack size? 256k is 1/4 of the default stack size. Try not changing the default, or even increasing it. – spongebob Nov 13 '20 at 07:27
  • @apangin not yet, I opened the jfr, but nothing found. – xiaoxxcool Nov 13 '20 at 07:29
  • @FrancescoMenzani it's a historical option from jdk8, which works well, I only changed the gc related options, I will make it the default value, thanks. – xiaoxxcool Nov 13 '20 at 07:33
  • Do you have enough memory left for non-heap allocations? – spongebob Nov 13 '20 at 08:16
  • @FrancescoMenzani yes, there is 20G+ memory available on the host – xiaoxxcool Nov 13 '20 at 08:29
  • Lots of `ICBufferFull` safepoints is likely a consequence of a problem, not the initial cause. Profiling `InlineCacheBuffer::refill_ic_stubs` may give a clue – apangin Nov 14 '20 at 16:21
  • @apangin I profiled, but not found this method, see the edited question pls. – xiaoxxcool Nov 16 '20 at 09:02
  • What is the stack trace where `ThreadLocalMap.getEntryAfterMiss` is called from? When this method is hot, it usually means there are too many ThreadLocals *and* threads. – apangin Nov 16 '20 at 23:58
  • @apangin it called from lucene, yes, it thread local related, thank you! – xiaoxxcool Nov 17 '20 at 01:54

2 Answers2

2

This issue is finally turned out be a jdk issue, which supposed to be resolved in jdk16, and could be bypassed by making a thread pool which will terminate old thread and generate new thread periodically. I moved my application to jetty for some reason, the modified jetty thread pool is here. Now it works perfectly for days, take this if any one met the same problem.

xiaoxxcool
  • 311
  • 4
  • 8
1

Modern GCs delay collection of weakly-reachable objects.

Collections triggered by System.gc() always process weakly-reachable objects, and are concurrent by default, so you could implement a periodic task invoking the method.

spongebob
  • 8,370
  • 15
  • 50
  • 83
  • thx, I use -XX:ZCollectionInterval=3 instead, wait for the result – xiaoxxcool Nov 17 '20 at 10:59
  • -XX:ZCollectionInterval=3 does not make sense, I have a timer to call the System.gc every 3s, but it still increased to 1m+ in 8 hours. the cpu is 600% now, I posted some logs in edit3 – xiaoxxcool Nov 19 '20 at 10:09
  • @xxcool You should let ZGC do its job as it pleases, without forcing it to every 3 seconds. Then, to collect weak references, you should invoke `System.gc()` every now and then, depending to at which rate they are produced. Also make sure `System.gc()` does effectively trigger a GC. Also try to remove all the custom tweaks to ZGC in the VM options. – spongebob Nov 19 '20 at 17:46
  • From the very beginning no VM options were added except +UseZGC, I made those changes because the cpu usage. Now almost every gc is triggered by `System.gc()`, I will reduce the frequency of timer. – xiaoxxcool Nov 20 '20 at 01:50
  • the weak references are produced by lucene with a high frequency. maybe dozens per second. I reduce the timer of gc to 5mins, but not helped. Anyway thank you for the help. The zgc dev says the issue may be resolved in next release. – xiaoxxcool Nov 24 '20 at 14:01
  • @xxcool So what was the issue? Did ZGC have an issue? Did `System.gc()` not collect weak references? – spongebob Nov 24 '20 at 15:01
  • 1
    It is a jdk issue, cause some gc could not reclaim weak references properly, see https://bugs.openjdk.java.net/browse/JDK-8188055 – xiaoxxcool Nov 25 '20 at 00:44