We have been seeing inconsistent network failures when trying to set up Infinispan on EC2 (large instances) over Jgroups 3.1.0-FINAL running on Amazon's 64-bit linux AMI. An empty cache starts fine and seems to work for a time however once the cache is full, a new server getting synchronized causes the cache to lock.
We decided to roll our own cache but are seeing approximately the same behavior. 10s of megabytes are being exchanged during synchronization but they are not flooded. There is a back and forth data -> ack conversation at the application level but it looks like some of the messaging is never reaching the remote.
In looking at the UNICAST trace logging I'm seeing the following:
# my application starts a cache refresh operation
01:02:12.003 [Incoming-1,mprewCache,i-f6a9d986] DEBUG c.m.e.q.c.l.DistributedMapManager - i-f6a9d986: from i-d2e29fa2: search:REFRESH
01:02:12.003 [Incoming-1,mprewCache,i-f6a9d986] INFO c.m.e.q.c.l.DistributedMapRequest - starting REFRESH from i-d2e29fa2 for map search, map-size 62373
01:02:12.003 [Incoming-1,mprewCache,i-f6a9d986] DEBUG c.m.e.q.c.l.DistributedMapManager - i-f6a9d986: to i-d2e29fa2: search:PUT_MANY, 50 keyValues
# transmits a block of 50 values to the remote but this never seems to get there
01:02:12.004 [Incoming-1,mprewCache,i-f6a9d986] TRACE o.j.p.UNICAST - i-f6a9d986 --> DATA(i-d2e29fa2: #11, conn_id=10)
# acks another window
01:02:12.004 [Incoming-1,mprewCache,i-f6a9d986] TRACE o.j.p.UNICAST - i-f6a9d986 --> ACK(i-d2e29fa2: #4)
# these XMITs happen for over and over until 01:30:40
01:02:12.208 [Timer-2,mprewCache,i-f6a9d986] TRACE o.j.p.UNICAST - i-f6a9d986 --> XMIT(i-d2e29fa2: #6)
01:02:12.209 [Timer-2,mprewCache,i-f6a9d986] TRACE o.j.p.UNICAST - i-f6a9d986 --> XMIT(i-d2e29fa2: #7)
01:02:12.209 [Timer-2,mprewCache,i-f6a9d986] TRACE o.j.p.UNICAST - i-f6a9d986 --> XMIT(i-d2e29fa2: #8)
...
Here's our Jgroups stack. We replace the PING
protocol at runtime with our own EC2_PING
version which uses AWS calls to find other cluster member candidates. This is not a connection issue.
Any ideas why some of the packets are not arriving at their destination?