Infinispan 9.4.16, JBoss EAP 7.3 Lock Contention with replicated-cache 2 nodes threads are TIMED_WAITING (parking)

Question

I have an application that is currently relying on an infinispan replicated-cache to share a work queue across all nodes. The queue is pretty standard, with head, tail, and size pointers persisting in the infinispan map.

We've upgraded from Infinispan 7.2.5 to 9.4.16 and notice that the lock performance is much worse than before. I've managed to get the thread dumps from the 2 nodes while they were both trying to initialize the queue at the same time. With Infinispan 7.2.5 lock and sync performance was very good with no issues. Now we see lock timeouts and way more failures.

Node #1 partial stack trace from thread dump 2021-04-20 13:45:13:

"default task-2" #600 prio=5 os_prio=0 tid=0x000000000c559000 nid=0x1f8a waiting on condition [0x00007f4df3f72000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000006e1f4fec0> (a java.util.concurrent.CompletableFuture$Signaller)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
    at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
    at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
    at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
    at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:105)
    at org.infinispan.interceptors.impl.SimpleAsyncInvocationStage.get(SimpleAsyncInvocationStage.java:38)
    at org.infinispan.interceptors.impl.AsyncInterceptorChainImpl.invoke(AsyncInterceptorChainImpl.java:250)
    at org.infinispan.cache.impl.CacheImpl.lock(CacheImpl.java:1077)
    at org.infinispan.cache.impl.CacheImpl.lock(CacheImpl.java:1057)
    at org.infinispan.cache.impl.AbstractDelegatingAdvancedCache.lock(AbstractDelegatingAdvancedCache.java:286)
    at org.infinispan.cache.impl.EncoderCache.lock(EncoderCache.java:318)
    at com.siperian.mrm.match.InfinispanQueue.initialize(InfinispanQueue.java:88)

Node#2 partial stack trace from thread dump: 2021-04-20 13:45:04:

"default task-2" #684 prio=5 os_prio=0 tid=0x0000000011f26000 nid=0x3c60 waiting on condition [0x00007f55107e4000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x0000000746bd36d8> (a java.util.concurrent.CompletableFuture$Signaller)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
    at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
    at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
    at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
    at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:105)
    at org.infinispan.interceptors.impl.SimpleAsyncInvocationStage.get(SimpleAsyncInvocationStage.java:38)
    at org.infinispan.interceptors.impl.AsyncInterceptorChainImpl.invoke(AsyncInterceptorChainImpl.java:250)
    at org.infinispan.cache.impl.CacheImpl.lock(CacheImpl.java:1077)
    at org.infinispan.cache.impl.CacheImpl.lock(CacheImpl.java:1057)
    at org.infinispan.cache.impl.AbstractDelegatingAdvancedCache.lock(AbstractDelegatingAdvancedCache.java:286)
    at org.infinispan.cache.impl.EncoderCache.lock(EncoderCache.java:318)
    at com.siperian.mrm.match.InfinispanQueue.initialize(InfinispanQueue.java:88)

Client side error that pops up on the console of the machine running Node #1:

2021-04-20 13:45:49,069 ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (jgroups-15,infinispan-cleanse-cluster_192.168.0.24_cmx_system105,N1618938080334-63633(machine-id=M1618938080334)) ISPN000136: Error executing command LockControlCommand on Cache 'orclmdm-MDM_SAMPLE105/FUZZY_MATCH', writing keys []: org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 60 seconds for key QUEUE_TAIL_C_PARTY and requestor GlobalTx:N1618938080334-63633(machine-id=M1618938080334):429. Lock is held by GlobalTx:N1618938062946-60114(machine-id=M1618938062946):420
    at org.infinispan.util.concurrent.locks.impl.DefaultLockManager$KeyAwareExtendedLockPromise.get(DefaultLockManager.java:288)
    at org.infinispan.util.concurrent.locks.impl.DefaultLockManager$KeyAwareExtendedLockPromise.lock(DefaultLockManager.java:261)
    at org.infinispan.util.concurrent.locks.impl.DefaultLockManager$CompositeLockPromise.lock(DefaultLockManager.java:348)
    at org.infinispan.interceptors.locking.PessimisticLockingInterceptor.localLockCommandWork(PessimisticLockingInterceptor.java:208)
    at org.infinispan.interceptors.locking.PessimisticLockingInterceptor.lambda$new$0(PessimisticLockingInterceptor.java:46)
    at org.infinispan.interceptors.InvocationSuccessFunction.apply(InvocationSuccessFunction.java:25)
    at org.infinispan.interceptors.impl.QueueAsyncInvocationStage.invokeQueuedHandlers(QueueAsyncInvocationStage.java:118)
    at org.infinispan.interceptors.impl.QueueAsyncInvocationStage.accept(QueueAsyncInvocationStage.java:81)
    at org.infinispan.interceptors.impl.QueueAsyncInvocationStage.accept(QueueAsyncInvocationStage.java:30)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
    at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
    at org.infinispan.remoting.transport.AbstractRequest.complete(AbstractRequest.java:67)
    at org.infinispan.remoting.transport.impl.MultiTargetRequest.onResponse(MultiTargetRequest.java:102)
    at org.infinispan.remoting.transport.impl.RequestRepository.addResponse(RequestRepository.java:52)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1369)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1272)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport.access$300(JGroupsTransport.java:126)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1417)
    at org.jgroups.JChannel.up(JChannel.java:816)
    at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:900)
    at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:128)
    at org.jgroups.protocols.RSVP.up(RSVP.java:163)
    at org.jgroups.protocols.FRAG2.up(FRAG2.java:177)
    at org.jgroups.protocols.FlowControl.up(FlowControl.java:339)
    at org.jgroups.protocols.FlowControl.up(FlowControl.java:339)
    at org.jgroups.protocols.pbcast.GMS.up(GMS.java:872)
    at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:240)
    at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1008)
    at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:734)
    at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:389)
    at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:590)
    at org.jgroups.protocols.BARRIER.up(BARRIER.java:171)
    at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:131)
    at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:203)
    at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:253)
    at org.jgroups.protocols.MERGE3.up(MERGE3.java:280)
    at org.jgroups.protocols.Discovery.up(Discovery.java:295)
    at org.jgroups.protocols.TP.passMessageUp(TP.java:1250)
    at org.jgroups.util.SubmitToThreadPool$SingleMessageHandler.run(SubmitToThreadPool.java:87)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Infinispan Config:

<?xml version="1.0" encoding="UTF-8"?>
<infinispan
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:infinispan:config:9.4 http://www.infinispan.org/schemas/infinispan-config-9.4.xsd"
        xmlns="urn:infinispan:config:9.4">    

    <jgroups>
        <stack-file name="mdmudp" path="$cmx.home$/jgroups-udp.xml" />
        <stack-file name="mdmtcp" path="$cmx.home$/jgroups-tcp.xml" />
    </jgroups>

    <cache-container name="MDMCacheManager" statistics="true"
        shutdown-hook="DEFAULT">
        <transport stack="mdmudp" cluster="infinispan-cluster"
            node-name="$node$" machine="$machine$" />

        <jmx domain="org.infinispan.mdm.hub"/>  

        <replicated-cache name="FUZZY_MATCH" statistics="true" unreliable-return-values="false">
            <locking isolation="READ_COMMITTED" acquire-timeout="60000"
                concurrency-level="5000" striping="false" />
            <transaction
                transaction-manager-lookup="org.infinispan.transaction.lookup.GenericTransactionManagerLookup"
                stop-timeout="30000" auto-commit="true" locking="PESSIMISTIC"
                mode="NON_XA" notifications="true" />
        </replicated-cache>

    </cache-container>
</infinispan>

We are using udp multicast on default, here's the udp config:

<!--
  Default stack using IP multicasting. It is similar to the "udp"
  stack in stacks.xml, but doesn't use streaming state transfer and flushing
  author: Bela Ban
-->

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd">
    <UDP
         mcast_port="${jgroups.udp.mcast_port:46688}"
         ip_ttl="4"
         tos="8"
         ucast_recv_buf_size="5M"
         ucast_send_buf_size="5M"
         mcast_recv_buf_size="5M"
         mcast_send_buf_size="5M"
         max_bundle_size="64K"
         enable_diagnostics="true"
         thread_naming_pattern="cl"

         thread_pool.enabled="true"
         thread_pool.min_threads="2"
         thread_pool.max_threads="8"
         thread_pool.keep_alive_time="5000"/>

    <PING />
    <MERGE3 max_interval="30000"
            min_interval="10000"/>
    <FD_SOCK/>
    <FD_ALL/>
    <VERIFY_SUSPECT timeout="1500"  />
    <BARRIER />
    <pbcast.NAKACK2 xmit_interval="500"
                    xmit_table_num_rows="100"
                    xmit_table_msgs_per_row="2000"
                    xmit_table_max_compaction_time="30000"
                    use_mcast_xmit="false"
                    discard_delivered_msgs="true"/>
    <UNICAST3 xmit_interval="500"
              xmit_table_num_rows="100"
              xmit_table_msgs_per_row="2000"
              xmit_table_max_compaction_time="60000"
              conn_expiry_timeout="0"/>
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="4M"/>
    <pbcast.GMS print_local_addr="true" join_timeout="2000"
                view_bundling="true"/>
    <UFC max_credits="2M"
         min_threshold="0.4"/>
    <MFC max_credits="2M"
         min_threshold="0.4"/>
    <FRAG2 frag_size="60K"  />
    <RSVP resend_interval="2000" timeout="10000"/>
    <pbcast.STATE_TRANSFER />
    <!-- pbcast.FLUSH  /-->
</config>

Any ideas on configuration would be great. What happens is that both nodes timeout and the queue doesn't initialize properly (null keys). Thanks in advance. BTW there are a total of up to 24 threads on each node (48 total) that may access the shared queue.

That JGroups configuration file looks like it's been adapted from an even older source, I suggest switching to the default-jgroups-udp.xml stack from Infinispan and making only the changes you actually need. That being said, there isn't anything obviously wrong in it. You should also try upgrading directly to 12.1.1.Final, switching to TCP, and/or running your application with `-Djava.net.preferIPv4Stack=true` to avoid IPv6 concerns. Failing that, you should post the code that interacts with the cache, preferably as a self-contained test that reproduces the issue. — Dan Berindei, Apr 22 '21 at 12:32
Dan thank you for your comment. You are right, it's been adapted from an Iinfisipan 7 version (or even older, I'm not sure, the code predates my joining). It seems the issue stems from poorer lock performance which causes deadlocks to be more likely. I've worked around this by failing fast and retrying in my code, rather than relying on the pessimistic lock to free by itself. — Aug, Apr 24 '21 at 03:34

Aug · Answer 1 · 2021-04-24T03:38:27.190

I did some research and and it turns out locks against replicated caches are done against the remote nodes first before trying to lock the keys locally. I believe deadlocks are possible if node1 tries to lock node2 at the same time node2 tries to lock node1. Due to this, I've changed all my caches to use Flag.FAIL_SILENTLY, and Flag.ZERO_LOCK_ACQUISITION_TIMEOUT, and added additional retry logic on the client side when adding or removing elements from the queue. From initial testing things look much better now.

I'm curious what was changed between Infinispan 7 and later versions to make pessimistic locking perform so much worse in newer versions. The old client code (without Flags or retry logic) worked perfectly under the same testing conditions previously. I'm suspicious of the changes related to using futures and forkJoinPool, as I've run into issues when using those in other projects and had to go back to doing things the old way with standard Executors.

Infinispan 9.4.16, JBoss EAP 7.3 Lock Contention with replicated-cache 2 nodes threads are TIMED_WAITING (parking)

1 Answers1