I have an application that is currently relying on an infinispan replicated-cache to share a work queue across all nodes. The queue is pretty standard, with head, tail, and size pointers persisting in the infinispan map.
We've upgraded from Infinispan 7.2.5 to 9.4.16 and notice that the lock performance is much worse than before. I've managed to get the thread dumps from the 2 nodes while they were both trying to initialize the queue at the same time. With Infinispan 7.2.5 lock and sync performance was very good with no issues. Now we see lock timeouts and way more failures.
Node #1 partial stack trace from thread dump 2021-04-20 13:45:13:
"default task-2" #600 prio=5 os_prio=0 tid=0x000000000c559000 nid=0x1f8a waiting on condition [0x00007f4df3f72000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000006e1f4fec0> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:105)
at org.infinispan.interceptors.impl.SimpleAsyncInvocationStage.get(SimpleAsyncInvocationStage.java:38)
at org.infinispan.interceptors.impl.AsyncInterceptorChainImpl.invoke(AsyncInterceptorChainImpl.java:250)
at org.infinispan.cache.impl.CacheImpl.lock(CacheImpl.java:1077)
at org.infinispan.cache.impl.CacheImpl.lock(CacheImpl.java:1057)
at org.infinispan.cache.impl.AbstractDelegatingAdvancedCache.lock(AbstractDelegatingAdvancedCache.java:286)
at org.infinispan.cache.impl.EncoderCache.lock(EncoderCache.java:318)
at com.siperian.mrm.match.InfinispanQueue.initialize(InfinispanQueue.java:88)
Node#2 partial stack trace from thread dump: 2021-04-20 13:45:04:
"default task-2" #684 prio=5 os_prio=0 tid=0x0000000011f26000 nid=0x3c60 waiting on condition [0x00007f55107e4000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000746bd36d8> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:105)
at org.infinispan.interceptors.impl.SimpleAsyncInvocationStage.get(SimpleAsyncInvocationStage.java:38)
at org.infinispan.interceptors.impl.AsyncInterceptorChainImpl.invoke(AsyncInterceptorChainImpl.java:250)
at org.infinispan.cache.impl.CacheImpl.lock(CacheImpl.java:1077)
at org.infinispan.cache.impl.CacheImpl.lock(CacheImpl.java:1057)
at org.infinispan.cache.impl.AbstractDelegatingAdvancedCache.lock(AbstractDelegatingAdvancedCache.java:286)
at org.infinispan.cache.impl.EncoderCache.lock(EncoderCache.java:318)
at com.siperian.mrm.match.InfinispanQueue.initialize(InfinispanQueue.java:88)
Client side error that pops up on the console of the machine running Node #1:
2021-04-20 13:45:49,069 ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (jgroups-15,infinispan-cleanse-cluster_192.168.0.24_cmx_system105,N1618938080334-63633(machine-id=M1618938080334)) ISPN000136: Error executing command LockControlCommand on Cache 'orclmdm-MDM_SAMPLE105/FUZZY_MATCH', writing keys []: org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 60 seconds for key QUEUE_TAIL_C_PARTY and requestor GlobalTx:N1618938080334-63633(machine-id=M1618938080334):429. Lock is held by GlobalTx:N1618938062946-60114(machine-id=M1618938062946):420
at org.infinispan.util.concurrent.locks.impl.DefaultLockManager$KeyAwareExtendedLockPromise.get(DefaultLockManager.java:288)
at org.infinispan.util.concurrent.locks.impl.DefaultLockManager$KeyAwareExtendedLockPromise.lock(DefaultLockManager.java:261)
at org.infinispan.util.concurrent.locks.impl.DefaultLockManager$CompositeLockPromise.lock(DefaultLockManager.java:348)
at org.infinispan.interceptors.locking.PessimisticLockingInterceptor.localLockCommandWork(PessimisticLockingInterceptor.java:208)
at org.infinispan.interceptors.locking.PessimisticLockingInterceptor.lambda$new$0(PessimisticLockingInterceptor.java:46)
at org.infinispan.interceptors.InvocationSuccessFunction.apply(InvocationSuccessFunction.java:25)
at org.infinispan.interceptors.impl.QueueAsyncInvocationStage.invokeQueuedHandlers(QueueAsyncInvocationStage.java:118)
at org.infinispan.interceptors.impl.QueueAsyncInvocationStage.accept(QueueAsyncInvocationStage.java:81)
at org.infinispan.interceptors.impl.QueueAsyncInvocationStage.accept(QueueAsyncInvocationStage.java:30)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
at org.infinispan.remoting.transport.AbstractRequest.complete(AbstractRequest.java:67)
at org.infinispan.remoting.transport.impl.MultiTargetRequest.onResponse(MultiTargetRequest.java:102)
at org.infinispan.remoting.transport.impl.RequestRepository.addResponse(RequestRepository.java:52)
at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1369)
at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1272)
at org.infinispan.remoting.transport.jgroups.JGroupsTransport.access$300(JGroupsTransport.java:126)
at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1417)
at org.jgroups.JChannel.up(JChannel.java:816)
at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:900)
at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:128)
at org.jgroups.protocols.RSVP.up(RSVP.java:163)
at org.jgroups.protocols.FRAG2.up(FRAG2.java:177)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:339)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:339)
at org.jgroups.protocols.pbcast.GMS.up(GMS.java:872)
at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:240)
at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1008)
at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:734)
at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:389)
at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:590)
at org.jgroups.protocols.BARRIER.up(BARRIER.java:171)
at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:131)
at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:203)
at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:253)
at org.jgroups.protocols.MERGE3.up(MERGE3.java:280)
at org.jgroups.protocols.Discovery.up(Discovery.java:295)
at org.jgroups.protocols.TP.passMessageUp(TP.java:1250)
at org.jgroups.util.SubmitToThreadPool$SingleMessageHandler.run(SubmitToThreadPool.java:87)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Infinispan Config:
<?xml version="1.0" encoding="UTF-8"?>
<infinispan
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:infinispan:config:9.4 http://www.infinispan.org/schemas/infinispan-config-9.4.xsd"
xmlns="urn:infinispan:config:9.4">
<jgroups>
<stack-file name="mdmudp" path="$cmx.home$/jgroups-udp.xml" />
<stack-file name="mdmtcp" path="$cmx.home$/jgroups-tcp.xml" />
</jgroups>
<cache-container name="MDMCacheManager" statistics="true"
shutdown-hook="DEFAULT">
<transport stack="mdmudp" cluster="infinispan-cluster"
node-name="$node$" machine="$machine$" />
<jmx domain="org.infinispan.mdm.hub"/>
<replicated-cache name="FUZZY_MATCH" statistics="true" unreliable-return-values="false">
<locking isolation="READ_COMMITTED" acquire-timeout="60000"
concurrency-level="5000" striping="false" />
<transaction
transaction-manager-lookup="org.infinispan.transaction.lookup.GenericTransactionManagerLookup"
stop-timeout="30000" auto-commit="true" locking="PESSIMISTIC"
mode="NON_XA" notifications="true" />
</replicated-cache>
</cache-container>
</infinispan>
We are using udp multicast on default, here's the udp config:
<!--
Default stack using IP multicasting. It is similar to the "udp"
stack in stacks.xml, but doesn't use streaming state transfer and flushing
author: Bela Ban
-->
<config xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd">
<UDP
mcast_port="${jgroups.udp.mcast_port:46688}"
ip_ttl="4"
tos="8"
ucast_recv_buf_size="5M"
ucast_send_buf_size="5M"
mcast_recv_buf_size="5M"
mcast_send_buf_size="5M"
max_bundle_size="64K"
enable_diagnostics="true"
thread_naming_pattern="cl"
thread_pool.enabled="true"
thread_pool.min_threads="2"
thread_pool.max_threads="8"
thread_pool.keep_alive_time="5000"/>
<PING />
<MERGE3 max_interval="30000"
min_interval="10000"/>
<FD_SOCK/>
<FD_ALL/>
<VERIFY_SUSPECT timeout="1500" />
<BARRIER />
<pbcast.NAKACK2 xmit_interval="500"
xmit_table_num_rows="100"
xmit_table_msgs_per_row="2000"
xmit_table_max_compaction_time="30000"
use_mcast_xmit="false"
discard_delivered_msgs="true"/>
<UNICAST3 xmit_interval="500"
xmit_table_num_rows="100"
xmit_table_msgs_per_row="2000"
xmit_table_max_compaction_time="60000"
conn_expiry_timeout="0"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
max_bytes="4M"/>
<pbcast.GMS print_local_addr="true" join_timeout="2000"
view_bundling="true"/>
<UFC max_credits="2M"
min_threshold="0.4"/>
<MFC max_credits="2M"
min_threshold="0.4"/>
<FRAG2 frag_size="60K" />
<RSVP resend_interval="2000" timeout="10000"/>
<pbcast.STATE_TRANSFER />
<!-- pbcast.FLUSH /-->
</config>
Any ideas on configuration would be great. What happens is that both nodes timeout and the queue doesn't initialize properly (null keys). Thanks in advance. BTW there are a total of up to 24 threads on each node (48 total) that may access the shared queue.