0

We have a tomcat 9.0.62 server that reliably locks up within 1-3hours. When we dump the thread states using jstack -F <PID> we see 359 threads in a BLOCKED state and 9 threads in a state of IN_NATIVE. Lots of available memory, lots of disk, low cpu usage. We have only seen this since we upgraded from OpenJDK 1.8.0u302 to 1.8.0u322 (on Solaris) (when we downgrade to u302 the problem goes away). UPDATE: The problem also occurs on 1.8.0u332.

According to java documentation a state of BLOCKED means "waiting for a monitor lock". What I don't understand is how can an operation that does no synchronization be in a BLOCKED state? Here's an example, blocked in Pattern.compile:

Thread t@1467: (state = BLOCKED)
- java.util.regex.Pattern.compile() @bci=144, line=1693 (Compiled frame)
 - java.util.regex.Pattern.<init>(java.lang.String, int) @bci=62, line=1352 (Compiled frame)
 - java.util.regex.Pattern.compile(java.lang.String) @bci=6, line=1028 (Compiled frame)
 - oracle.jdbc.driver.OracleStatement.<init>(oracle.jdbc.driver.PhysicalConnection, int, int, int, int) @bci=377, line=6198 (Compiled frame)
 - oracle.jdbc.driver.OraclePreparedStatement.<init>(oracle.jdbc.driver.PhysicalConnection, java.lang.String, int, int, int, int) @bci=9, line=1363 (Compiled frame)
 - oracle.jdbc.driver.T4CPreparedStatement.<init>(oracle.jdbc.driver.PhysicalConnection, java.lang.String, int, int) @bci=14, line=43 (Compiled frame)
 - oracle.jdbc.driver.T4CDriverExtension.allocatePreparedStatement(oracle.jdbc.driver.PhysicalConnection, java.lang.String, int, int) @bci=9, line=75 (Compiled frame)
 - oracle.jdbc.driver.PhysicalConnection.prepareStatementInternal(java.lang.String, int, int) @bci=120, line=1648 (Compiled frame)
 - oracle.jdbc.driver.PhysicalConnection.prepareStatement(java.lang.String, int, int) @bci=4, line=1606 (Compiled frame)
 - oracle.jdbc.driver.PhysicalConnection.prepareStatement(java.lang.String) @bci=4, line=1533 (Compiled frame)
 - com.zaxxer.hikari.pool.ProxyConnection.prepareStatement(java.lang.String) @bci=7, line=337 (Compiled frame)
 - com.zaxxer.hikari.pool.HikariProxyConnection.prepareStatement(java.lang.String) @bci=2 (Compiled frame)
 - org.hibernate.internal.NonContextualJdbcConnectionAccess.obtainConnection() @bci=4, line=37 (Compiled frame)
 - org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.acquireConnectionIfNeeded() @bci=12, line=108 (Compiled frame)
 - org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.getPhysicalConnection() @bci=5, line=138 (Compiled frame)
 - org.hibernate.engine.jdbc.internal.StatementPreparerImpl.connection() @bci=4, line=50 (Compiled frame)
 - org.hibernate.engine.jdbc.internal.StatementPreparerImpl$1.doPrepare() @bci=11, line=89 (Compiled frame)
 - org.hibernate.engine.jdbc.internal.StatementPreparerImpl$StatementPreparationTemplate.prepareStatement() @bci=47, line=176 (Compiled frame)
 - org.hibernate.engine.jdbc.internal.StatementPreparerImpl.prepareQueryStatement(java.lang.String, boolean, org.hibernate.ScrollMode) @bci=74, line=151 (Compiled frame)
 - org.hibernate.loader.Loader.prepareQueryStatement(java.lang.String, org.hibernate.engine.spi.QueryParameters, org.hibernate.dialect.pagination.LimitHandler, boolean, org.hibernate.engine.spi.SharedSessionContractImplementor) @bci=39, line=2127 (Compiled frame)
 - org.hibernate.loader.Loader.executeQueryStatement(java.lang.String, org.hibernate.engine.spi.QueryParameters, boolean, java.util.List, org.hibernate.engine.spi.SharedSessionContractImplementor) @bci=55, line=2059 (Compiled frame)
 - org.hibernate.loader.Loader.executeQueryStatement(org.hibernate.engine.spi.QueryParameters, boolean, java.util.List, org.hibernate.engine.spi.SharedSessionContractImplementor) @bci=10, line=2037 (Compiled frame)
 - org.hibernate.loader.Loader.doQuery(org.hibernate.engine.spi.SharedSessionContractImplementor, org.hibernate.engine.spi.QueryParameters, boolean, org.hibernate.transform.ResultTransformer) @bci=44, line=956 (Compiled frame)
 - org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(org.hibernate.engine.spi.SharedSessionContractImplementor, org.hibernate.engine.spi.QueryParameters, boolean, org.hibernate.transform.ResultTransformer) @bci=62, line=357 (Compiled frame)
 - org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(org.hibernate.engine.spi.SharedSessionContractImplementor, org.hibernate.engine.spi.QueryParameters, boolean) @bci=5, line=327 (Compiled frame)
 - org.hibernate.loader.Loader.loadEntity(org.hibernate.engine.spi.SharedSessionContractImplementor, java.lang.Object, org.hibernate.type.Type, java.lang.Object, java.lang.String, java.io.Serializable, org.hibernate.persister.entity.EntityPersister, org.hibernate.LockOptions, java.lang.Boolean) @bci=116, line=2440 (Compiled frame)
 - org.hibernate.loader.entity.AbstractEntityLoader.load(org.hibernate.engine.spi.SharedSessionContractImplementor, java.lang.Object, java.lang.Object, java.io.Serializable, org.hibernate.LockOptions, java.lang.Boolean) @bci=22, line=77 (Compiled frame)
 - org.hibernate.loader.entity.AbstractEntityLoader.load(java.lang.Object, org.hibernate.engine.spi.SharedSessionContractImplementor, org.hibernate.LockOptions) @bci=7, line=66 (Compiled frame)
 - org.hibernate.persister.entity.AbstractEntityPersister.loadByUniqueKey(java.lang.String, java.lang.Object, org.hibernate.engine.spi.SharedSessionContractImplementor) @bci=11, line=2486 (Compiled frame)
 - org.hibernate.type.EntityType.loadByUniqueKey(java.lang.String, java.lang.String, java.lang.Object, org.hibernate.engine.spi.SharedSessionContractImplementor) @bci=90, line=756 (Compiled frame)
 - org.hibernate.type.EntityType.resolve(java.lang.Object, org.hibernate.engine.spi.SharedSessionContractImplementor, java.lang.Object, java.lang.Boolean) @bci=50, line=467 (Compiled frame)
 - org.hibernate.engine.internal.TwoPhaseLoad$EntityResolver.lambda$static$0(org.hibernate.type.EntityType, java.lang.Object, org.hibernate.engine.spi.SharedSessionContractImplementor, java.lang.Object, java.lang.Boolean) @bci=6, line=576 (Compiled frame)
 - org.hibernate.engine.internal.TwoPhaseLoad$EntityResolver$$Lambda$1426.resolve(org.hibernate.type.EntityType, java.lang.Object, org.hibernate.engine.spi.SharedSessionContractImplementor, java.lang.Object, java.lang.Boolean) @bci=7 (Compiled frame)
 - org.hibernate.engine.internal.TwoPhaseLoad.initializeEntityEntryLoadedState(java.lang.Object, org.hibernate.engine.spi.EntityEntry, org.hibernate.engine.spi.SharedSessionContractImplementor, org.hibernate.engine.internal.TwoPhaseLoad$EntityResolver) @bci=348, line=221 (Compiled frame)
 - org.hibernate.engine.internal.TwoPhaseLoad.initializeEntity(java.lang.Object, boolean, org.hibernate.engine.spi.SharedSessionContractImplementor, org.hibernate.event.spi.PreLoadEvent, org.hibernate.engine.internal.TwoPhaseLoad$EntityResolver) @bci=48, line=156 (Compiled frame)
 - org.hibernate.engine.internal.TwoPhaseLoad.initializeEntity(java.lang.Object, boolean, org.hibernate.engine.spi.SharedSessionContractImplementor, org.hibernate.event.spi.PreLoadEvent) @bci=7, line=126 (Compiled frame)
 - org.hibernate.loader.Loader.initializeEntitiesAndCollections(java.util.List, java.lang.Object, org.hibernate.engine.spi.SharedSessionContractImplementor, boolean, java.util.List) @bci=171, line=1201 (Compiled frame)
 - org.hibernate.loader.Loader.doQuery(org.hibernate.engine.spi.SharedSessionContractImplementor, org.hibernate.engine.spi.QueryParameters, boolean, org.hibernate.transform.ResultTransformer) @bci=102, line=979 (Compiled frame)
 - org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(org.hibernate.engine.spi.SharedSessionContractImplementor, org.hibernate.engine.spi.QueryParameters, boolean, org.hibernate.transform.ResultTransformer) @bci=62, line=357 (Compiled frame)
 - org.hibernate.loader.Loader.doList(org.hibernate.engine.spi.SharedSessionContractImplementor, org.hibernate.engine.spi.QueryParameters, org.hibernate.transform.ResultTransformer) @bci=38, line=2868 (Compiled frame)
 - org.hibernate.loader.Loader.doList(org.hibernate.engine.spi.SharedSessionContractImplementor, org.hibernate.engine.spi.QueryParameters) @bci=4, line=2850 (Compiled frame)
 - org.hibernate.loader.Loader.listIgnoreQueryCache(org.hibernate.engine.spi.SharedSessionContractImplementor, org.hibernate.engine.spi.QueryParameters) @bci=4, line=2682 (Compiled frame)
 - org.hibernate.loader.Loader.list(org.hibernate.engine.spi.SharedSessionContractImplementor, org.hibernate.engine.spi.QueryParameters, java.util.Set, org.hibernate.type.Type[]) @bci=49, line=2677 (Compiled frame)
 - org.hibernate.loader.hql.QueryLoader.list(org.hibernate.engine.spi.SharedSessionContractImplementor, org.hibernate.engine.spi.QueryParameters) @bci=19, line=540 (Compiled frame)
 - org.hibernate.hql.internal.ast.QueryTranslatorImpl.list(org.hibernate.engine.spi.SharedSessionContractImplementor, org.hibernate.engine.spi.QueryParameters) @bci=179, line=400 (Compiled frame)
 - org.apache.log4j.Category.getEffectiveLevel() @bci=23, line=442 (Compiled frame)
 - org.apache.log4j.Category.isEnabledFor(org.apache.log4j.Priority) @bci=20, line=751 (Compiled frame)
 - org.jboss.logging.Log4jLogger.isEnabled(org.jboss.logging.Logger$Level) @bci=10, line=36 (Compiled frame)
 - org.jboss.logging.Logger.isTraceEnabled() @bci=4, line=98 (Compiled frame)
 - org.jboss.logging.DelegatingBasicLogger.isTraceEnabled() @bci=4, line=54 (Compiled frame)
 - org.hibernate.engine.query.spi.HQLQueryPlan.performList(org.hibernate.engine.spi.QueryParameters, org.hibernate.engine.spi.SharedSessionContractImplementor) @bci=3, line=195 (Compiled frame)
 - org.hibernate.internal.SessionImpl.list(java.lang.String, org.hibernate.engine.spi.QueryParameters) @bci=53, line=1454 (Compiled frame)
 - org.hibernate.query.internal.AbstractProducedQuery.doList() @bci=85, line=1649 (Compiled frame)
 - org.hibernate.query.internal.AbstractProducedQuery.list() @bci=5, line=1617 (Compiled frame)
 - org.hibernate.query.Query.getResultList() @bci=1, line=165 (Compiled frame)
...

UPDATE I've checked the GC logs and I see this entry 4 seconds after the last log line emitted by any of my tomcat apps:

2022-05-19T00:57:45.782+0000: 2602.365: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 73400320 bytes, new threshold 15 (max 15)
- age   1:    9497240 bytes,    9497240 total
, 0.1257611 secs]
   [Parallel Time: 111.6 ms, GC Workers: 15]
      [GC Worker Start (ms): Min: 2602366.7, Avg: 2602366.8, Max: 2602366.9, Diff: 0.2]
      [Ext Root Scanning (ms): Min: 1.4, Avg: 4.1, Max: 24.4, Diff: 22.9, Sum: 61.3]
      [Update RS (ms): Min: 6.9, Avg: 26.4, Max: 29.4, Diff: 22.5, Sum: 396.2]
         [Processed Buffers: Min: 4, Avg: 18.6, Max: 27, Diff: 23, Sum: 279]
      [Scan RS (ms): Min: 19.5, Avg: 20.2, Max: 20.9, Diff: 1.4, Sum: 303.3]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Object Copy (ms): Min: 60.2, Avg: 60.4, Max: 60.7, Diff: 0.6, Sum: 906.7]
      [Termination (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 1.2]
         [Termination Attempts: Min: 1, Avg: 194.5, Max: 260, Diff: 259, Sum: 2917]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 2.1]
      [GC Worker Total (ms): Min: 111.2, Avg: 111.4, Max: 111.5, Diff: 0.3, Sum: 1671.0]
      [GC Worker End (ms): Min: 2602478.1, Avg: 2602478.2, Max: 2602478.3, Diff: 0.2]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [String Dedup Fixup: 7.5 ms, GC Workers: 15]
      [Queue Fixup (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Table Fixup (ms): Min: 7.0, Avg: 7.2, Max: 7.4, Diff: 0.4, Sum: 108.1]
   [Clear CT: 0.6 ms]
   [Other: 6.0 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.5 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.4 ms]
      [Humongous Register: 0.4 ms]
      [Humongous Reclaim: 0.1 ms]
      [Free CSet: 2.9 ms]
   [Eden: 1098.0M(1098.0M)->0.0B(774.0M) Survivors: 20480.0K->77824.0K Heap: 3829.9M(6144.0M)->2788.9M(6144.0M)]
 [Times: user=1.78 sys=0.00, real=0.13 secs] 

What is missing there is a line like this which immediately follows every other GC entry in the log:

2022-05-19T00:57:42.081+0000: 2598.664: Total time for which application threads were stopped: 0.3016893 seconds, Stopping threads took: 0.0002452 seconds

The fact that this line is missing suggests to me that GC didn't manage to resume the threads. Given that, I'm assuming this is a GC bug in OpenJDK 1.8.0u322.

These are the memory and GC related settings on our tomcat process:

-server 
-Xms2400m 
-Xmx6g 
-XX:+UseStringDeduplication 
-XX:CodeCacheMinimumFreeSpace=25m 
-XX:ReservedCodeCacheSize=200m 
-XX:-TieredCompilation 
-XX:MaxMetaspaceSize=1000m 
-XX:+UseG1GC 
-XX:MaxGCPauseMillis=200 
-Xloggc:gc.log.20220519_0014 
-verbose:gc 
-XX:+PrintGCDateStamps 
-XX:+PrintGCDetails 
-XX:+PrintGCCause 
-XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintTenuringDistribution 
-XX:+UseGCLogFileRotation 
-XX:NumberOfGCLogFiles=20 
-XX:GCLogFileSize=50M 
-XX:+HeapDumpOnOutOfMemoryError 
-d64

Here is the first line of every thread's stack trace: https://gist.github.com/davidmoten/c3b19afbfce981fec1f0722ec231f354. In that list you can see lots of innocuous calls that have no synchronization (which for me points to GC).

Here also is the result of a pstack call on the tomcat process (a different instance this time, sorry, but with the same behaviour): https://gist.github.com/davidmoten/699cf6e601fe018b8a2e1466d32ed6c7

In the pstack output I see thread traces like the one below where the Quartz scheduler tries to create a new Thread but is waiting on GC (the method GC_locker::stall_until_clear). CPU usage for the tomcat process is 0.03% and I can't see an active GC thread in the list so I'm assuming that the GC stop-the-world thread has finished without resuming the blocked threads. I suppose my next step is to report a bug on OpenJDK.

------------  lwp# 134 / thread# 134   lname: Scheduler_QuartzSchedulerThread ---------------
 fffffffeffae24a8 lwp_cond_wait (105365a48, 105365a30, 0, 0)
 fffffffefee8f508 __1cCosNPlatformEventEpark6M_v_ (105365a00, cfbb8, cf800, 0, fffffffeff562aa8, 105365a30) + 100
 fffffffefee1ab54 __1cHMonitorFIWait6MpnGThread_l_i_ (1001371b0, 1052fb800, 0, 1001371b0, 71ab8, 0) + a4
 fffffffefee1b9b8 __1cHMonitorEwait6Mblb_b_ (1001371b0, 1052fb800, 0, 1b80, 2, 104a9e790) + 2c0
 fffffffefe926714 __1cJGC_lockerRstall_until_clear6F_v_ (1001371b0, 8e308, 8e000, 0, fffffffeff562aa8, fffffffeff5f0db0) + 18c
 fffffffefe8de400 __1cPG1CollectedHeapXattempt_allocation_slow6MLCpI1_pnIHeapWord__ (10014c830, 105, 100168318, fffffffeff5d5050, ffffffff4acaee5c, fffffffeff562aa8) + 300
 fffffffefe8dde00 __1cPG1CollectedHeapRallocate_new_tlab6ML_pnIHeapWord__ (10014c830, 105, 642200000, fffffffeff5d0780, 9a800, 828) + 74
 fffffffefe7938e4 __1cNCollectedHeapXallocate_from_tlab_slow6FnLKlassHandle_pnGThread_L_pnIHeapWord__ (17, 105, 3, 10, fffffffeff562aa8, 1052fb858) + 17c
 fffffffefe59327c __1cNCollectedHeapbAcommon_mem_allocate_noinit6FnLKlassHandle_LpnGThread__pnIHeapWord__ (7c2017bf0, 3, 1052fb800, 0, fffffffeff562aa8, 0) + 84
 fffffffefe9bc8f8 __1cNInstanceKlassRallocate_instance6MpnGThread__pnPinstanceOopDesc__ (40000000, 1052fb800, fffffffefe9ba1f8, fffffffeff5e2220, cfc00, 0) + 40
 fffffffefea0a69c __1cSInterpreterRuntimeE_new6FpnKJavaThread_pnMConstantPool_i_v_ (fffffffeff5e2220, ffffffff4d0819f0, 0, 1052fb800, fffffffeff5f83c0, fffffffeff562aa8) + 64c
 ffffffff72822888 * org/quartz/simpl/RAMJobStore.acquireNextTriggers(JIJ)Ljava/util/List;+8
 ffffffff72808998 * org/quartz/core/QuartzSchedulerThread.run()V+165
 ffffffff728002a0 * StubRoutines (1)
 fffffffefea26430 __1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_ (ffffffff4acafab0, 1, ffffffff4acaf958, ffffffff728114c0, ffffffff4ba9df18, 1052fb800) + 7c0
 fffffffefea249b0 __1cJJavaCallsMcall_virtual6FpnJJavaValue_nLKlassHandle_pnGSymbol_5pnRJavaCallArguments_pnGThread__v_ (ffffffff4acafab0, 1052fb800, 18, ffffffff4ba9df18, ffffffff4acaf958, 1052fb800) + 370
 fffffffefea24d00 __1cJJavaCallsMcall_virtual6FpnJJavaValue_nGHandle_nLKlassHandle_pnGSymbol_6pnGThread__v_ (8, 10419d7a0, 2, 1, ffffffff4acaf9a1, ffffffff4acaf960) + 4c
 fffffffefeb62790 __1cMthread_entry6FpnKJavaThread_pnGThread__v_ (10419d7a0, e, 97800, 10419d790, fffffffeff5fa518, 6c1a88988) + d0
 fffffffeff0437f0 __1cKJavaThreadRthread_main_inner6M_v_ (1052fb800, fffffffefeb626c0, 90298, 0, fffffffeff562aa8, 902a0) + 90
 fffffffefee80a50 java_start (fffffffefe7dbff0, 1052fb800, c3800, c3acc, fffffffeff562aa8, fffffffeff5fba30) + 380
 fffffffeffadd43c _lwp_start (0, 0, 0, 0, 0, 0)

My understanding is that jstack -F suspends threads anyway and resumes them when finished delivering its output. However, we made multiple jstack -F calls to the process and the stack traces did not change. This means to me that the threads are actually blocked all the time.

Dave Moten
  • 11,957
  • 2
  • 40
  • 47
  • 1
    Is there any chance the garbage collector is blocking itself? Do you [log GCs](https://wiki.openjdk.java.net/display/zgc/Main#Main-EnablingGCLogging)? (Are the other blocked threads equally as inconspicuous as Pattern.compile?) – Jeff Bowman May 19 '22 at 04:50
  • Actually, since you're in Hibernate and the classloader has an implicit lock, could it be [this JPQL classloading issue exactly](https://stackoverflow.com/q/26608973/1426891)? – Jeff Bowman May 19 '22 at 04:53
  • Thanks Jeff, yes there are many other innocuous looking blocked threads. Things like the HashSet constructor, HashSet.newNode, LinkedList.isPositionIndex. I'm thinking that GC is the key here in that a global stop-the-world GC has started and failed to release its lock. I'll have a look at our GC related settings and our GC logs (we do log them) and report back. – Dave Moten May 19 '22 at 08:55
  • I've added GC logs and a bit more. Looks to my eye like a GC bug. – Dave Moten May 19 '22 at 23:42
  • 1
    Nice evidence! I combed through the JDK bug list but didn't find anything obvious. Maybe it's a consequence of Hibernate criteria too, or maybe it's something brand new. I'd encourage a self-answer with some of the above, or if you find any more evidence. – Jeff Bowman May 20 '22 at 00:14
  • Also added pstack logs and some comments about them – Dave Moten May 20 '22 at 00:33
  • 1
    By the way, we don't use the Hibernate Criteria API at all – Dave Moten May 20 '22 at 00:38
  • 1
    Ah, then maybe that's a red herring. As a last check I'm curious whether your server fails after 1-3h on 1.8.0u322 without `+G1GC`, since that implementation seems to have some unique issues, but I don't know how mission-critical your system is or whether it bears experimentation. – Jeff Bowman May 20 '22 at 00:57
  • Good suggestion, fortunately we discovered the problem during release testing in non-prod so we can experiment. We do like the features of G1 though, particularly its suitability for large memory installations and high probability low pause times. Having high pause times would spark up our monitoring I suspect which we'd then have to tune. Yes it is a critical 24x7 system. – Dave Moten May 20 '22 at 03:54
  • 1
    jstack -F in my experience almost always marked thread states to BLOCKED. So the information you see might not be accurate as far as the thread state goes. If the jvm is responding, try "jstack -l " instead that should give you some idea of accurate thread states and lock information. – vinay chilakamarri May 20 '22 at 21:36
  • Thanks, unfortunately the jvm is not responding so we had to use the `-F` flag. This [link](https://stackoverflow.com/questions/12061226/how-does-jstack-f-affect-a-running-java-process) has more details about the suspension of a process from that call. I guess we might have to make multiple `jstack -F` calls and see if there's any change in the stack traces. I'll report back. – Dave Moten May 22 '22 at 10:01
  • Multiple jstack -F calls to the same process show no change in the stack traces so I assume that the suspension of threads is long lasting and not due to the jstack -F call itself. – Dave Moten May 24 '22 at 00:07

0 Answers0