2

We have near caches configured with main caches (in data node). Looking at documentation it says "Near caches are fully transactional and get updated or invalidated automatically whenever the data changes on the server nodes."

I am trying to understand how this communication of automatic updates work:

  • Will this communication be driven by "CacheWriteSynchronizationMode" ? So if I choose FULL_ASYNC mode then data node will be blocked till near cache updates are complete?
  • If above point is true then, will choosing PRIMARY_SYNC mode unblock data nodes?
  • Whenever cache items are expired in data node or persisted in disk (for persistent cache) then will data node immediately try to replicate this in near cache for client nodes?

We saw an issue where data nodes were continuously trying to connect to client node with this in log lines :

org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3322) [ignite-core-2.9.1.jar:2.9.1]

We suspect that Data nodes were blocked because these continuosly trying to send messages to near caches. This is based on stack trace we saw.

Full Stack trace:

at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:191) ~[ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141) ~[ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:3229) [ignite-core-2.9.1.jar:na]
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:3013) [ignite-core-2.9.1.jar:na]
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2960) [ignite-core-2.9.1.jar:na]
at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:2100) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:2365) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1964) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1935) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1917) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1324) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1261) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryHandler.onEntryUpdate(CacheContinuousQueryHandler.java:1059) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryHandler.access$600(CacheContinuousQueryHandler.java:90) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryHandler$2.onEntryUpdated(CacheContinuousQueryHandler.java:459) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryUpdated(CacheContinuousQueryManager.java:447) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:2495) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateSingle(GridDhtAtomicCache.java:2657) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update(GridDhtAtomicCache.java:2118) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1935) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1734) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3322) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$400(GridDhtAtomicCache.java:141) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:273) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:268) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1142) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:308) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1907) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1528) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.managers.communication.GridIoManager.access$5300(GridIoManager.java:241) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1421) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:565) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.9.1.jar:2.9.1]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212]

Caused by: org.apache.ignite.spi.communication.tcp.NodeForceEvictException: Node evicted forcefully from topology. at org.apache.ignite.spi.communication.tcp.IBTcpCommunicationSpi.createTcpClient(IBTcpCommunicationSpi.java:60) ~[ib-compute-grid-23.2.10.jar:na] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createCommunicationClient(TcpCommunicationSpi.java:3375) [ignite-core-2.9.1.jar:na] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:3180) [ignite-core-2.9.1.jar:na] ... 36 common frames omitted Caused by: org.apache.ignite.spi.communication.tcp.internal.NodeUnreachableException: Failed to connect to all addresses of node 4ae96cc6-d3ba-4bb4-94f8-4c116d5bd9eb: [/10.228.30.249:47000]; inverse connection will be requested. at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3982) [ignite-core-2.9.1.jar:na] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3635) [ignite-core-2.9.1.jar:na]

Lokesh
  • 7,810
  • 6
  • 48
  • 78

1 Answers1

1

I'm not sure that it's about near caching at all.

CacheWriteSynchronizationMode is about primary and backup partitions syncronization. Whilst near caching is about local node data memoization in on-heap and make sense for client nodes mostly or non-replicated caches on server nodes.

This line:

org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3322) [ignite-core-2.9.1.jar:2.9.1]

is not about a near cache, though the naming could be confusing. It's just a regular cache update on a local node. Remember that you have to enable a near cache explicitly in the configuration, as highlighted in the docs in order to make it work.

The stactrace is indeed a CQ notification routine.

a) Here a local update (more precisely - DELETE) happens:

at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:2495) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateSingle(GridDhtAtomicCache.java:2657) [ignite-core-2.9.1.jar:2.9.1]
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update(GridDhtAtomicCache.java:2118) [ignite-core-2.9.1.jar:2.9.1]

GridDhtAtomicCache impies that an update is happening to a non-transactional cache. GridCacheMapEntry is the base adapter for all entiries, whereas GridNearCacheEntry is for real near cache updates. In other words, I see that an update happens to a regular entry in offheap.

b) Then Ignite checks if there any active CQs, detects one and sends a notification, waiting for ACK:

    at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:191) ~[ignite-core-2.9.1.jar:2.9.1]
    ...
    at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:3229) [ignite-core-2.9.1.jar:na]
    at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:3013) [ignite-core-2.9.1.jar:na]
    ...
    at org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryUpdated(CacheContinuousQueryManager.java:447) [ignite-core-2.9.1.jar:2.9.1]

c) At the very bottom you have an interesting message about connectivity issue:

Failed to connect to all addresses of node 4ae96cc6-d3ba-4bb4-94f8-4c116d5bd9eb: [/10.228.30.249:47000]; inverse connection will be requested

I assume that CQ could not notify a listener due to n/w issues and hangs. There were improvements in CQ logic since 2.9.1, could solve the hang issue.

d) To have a more precise conclusion, you need to check the logs (thread dumps could help as well) from other nodes.

Alexandr Shapkin
  • 2,350
  • 1
  • 6
  • 10
  • Thanks @Alexandr. This is helpful. I wanted to clarify on one point. Based on details , above issue doesn't seem to be related to NearCache but wanted to clarify that in case of any change in data on server node like expiry, the update will flow to near cache on client nodes at realtime? Our caches have 0 backups and expiry time is 3 minutes, so we are evaluating if it is worth having near cache. – Lokesh Feb 22 '23 at 03:17
  • Yes, that's a pretty normal scenario. If an entry is removed or expired on primary partition, Ignite is going to notify the client with near cache cleaning the on heap entry. There are consistency guarantees, so a client must not see an old value once it's expired. – Alexandr Shapkin Feb 22 '23 at 11:45
  • Speaking of your config. BACKUPS=0 implies that you are OK to lose your data. I hope you understand the risks. Even if you have persistence enabled, if one of the nodes goes down, you will end up having LOST PARTITIONS. For in-memory cache, the data loss is eventual. – Alexandr Shapkin Feb 22 '23 at 11:47
  • If everything is expected, then having a near cache might be a good idea if you want to speed up some operations for a few caches. Remember that some features like SQL won't work with near caches, they still require working with "normal" offheap data. – Alexandr Shapkin Feb 22 '23 at 11:49
  • Overall, I'd say it depends. You can give it a try and check the performance if the current one doesn't suit your needs. Based on my experience, having a near cache is quite a rare case. If general, I'd go with a replicated cache or having more backups. – Alexandr Shapkin Feb 22 '23 at 11:51
  • 1
    A funny finding to take care of - https://issues.apache.org/jira/browse/IGNITE-13858. Looks to be non-presistence only though. – Alexandr Shapkin Mar 03 '23 at 14:07
  • Thanks for sharing, this is very helpful. We have been planning to disable persistence as we our cache expiry is very small. – Lokesh Mar 05 '23 at 06:32