We have a Java 8 application served by an Apache Tomcat 8 behind an Apache server, which is requesting multiple webservices in parallel using CXF. From time to time, there's one of them which lasts exactly 3 more seconds than the rest (which should be about 500ms only).
I've activated CXF debug, and now I have the place inside CXF where the 3 seconds are lost:
14/03/2018 09:20:49.061 [pool-838-thread-1] DEBUG o.a.cxf.transport.http.HTTPConduit - No Trust Decider for Conduit '{http://ws.webapp.com/}QueryWSImplPort.http-conduit'. An affirmative Trust Decision is assumed.
14/03/2018 09:20:52.077 [pool-838-thread-1] DEBUG o.a.cxf.transport.http.HTTPConduit - Sending POST Message with Headers to http://172.16.56.10:5050/services/quertServices Conduit :{http://ws.webapp.com/}QueryWSImplPort.http-conduit
As you could see, there're three seconds between these two lines. When the request is ok, it usually takes 0ms in between these two lines.
I've been looking into the CXF code, but no clue about the reason of these 3 secs...
The server application (which is also maintained by us), is served from another Apache Tomcat 6.0.49, which is behind another Apache server. The thing is that it seems that the server's Apache receives the request after the 3 seconds.
Anyone could help me?
EDIT: We've monitored the server's send/received packages and it seems that the client's server is sending a negotiating package at the time it should, while the server is replying after 3 seconds. These are the packages we've found:
481153 11:31:32 14/03/2018 2429.8542795 tomcat6.exe SOLTESTV010 SOLTESTV002 TCP TCP:Flags=CE....S., SrcPort=65160, DstPort=5050, PayloadLen=0, Seq=2858646321, Ack=0, Win=8192 ( Negotiating scale factor 0x8 ) = 8192 {TCP:5513, IPv4:62}
481686 11:31:35 14/03/2018 2432.8608381 tomcat6.exe SOLTESTV002 SOLTESTV010 TCP TCP:Flags=...A..S., SrcPort=5050, DstPort=65160, PayloadLen=0, Seq=436586023, Ack=2858646322, Win=8192 ( Negotiated scale factor 0x8 ) = 2097152 {TCP:5513, IPv4:62}
481687 11:31:35 14/03/2018 2432.8613607 tomcat6.exe SOLTESTV010 SOLTESTV002 TCP TCP:Flags=...A...., SrcPort=65160, DstPort=5050, PayloadLen=0, Seq=2858646322, Ack=436586024, Win=256 (scale factor 0x8) = 65536 {TCP:5513, IPv4:62}
481688 11:31:35 14/03/2018 2432.8628380 tomcat6.exe SOLTESTV010 SOLTESTV002 HTTP HTTP:Request, POST /services/consultaServices {HTTP:5524, TCP:5513, IPv4:62}
So, it seems is the server's Tomcat is the one which is blocked with something. Any clue?
EDIT 2:
Although that happened yesterday (the first server waiting 3s for the ack of the second), this is not the most common scenario. What it usually happens is what I described at the beginning (3 seconds between the two CXF's logs and the server receiving any request from the first one after 3 seconds.
There has been some times when the server (the one which receives the request), hangs for 3 seconds. For instance:
Server 1 sends 5 requests at the same time (suppossedly) to server 2.
Server 2 receives 4 of them, in that same second, and start to process them.
Server 2 finish processing 2 of those 4 requests in 30ms and replies to server 1.
More or less at this same second, there's nothing registered in the application logs.
After three seconds, logs are registered again, and the server finish to process the remaining 2 requests. So, although the process itself is about only some milliseconds, the response_time - request_time is 3 seconds and a few ms.
At this same time, the remaining request (the last one from the 5 request which were sent), is registered in the network monitor and is processed by the application in just a few milliseconds. However, the global processing time is anything more than 3s, as it has reached the server 3 seconds after being sent.
So there's like a hang in the middle of the process. 2 requests were successfully processed before this hang and replied in just a fraction of a second. 2 other request lasted a little bit more, the hang happened, and ended with a processing time of 3 seconds. The last one, reached the server just when the hang happened, so it didn't get into the application after the hang.
It sounds like a gc stop the world... but we have analyzed gc.logs and there's nothing wrong with that... could there be any other reason?
Thanks!
EDIT 3:
Looking at the TCP flags like the one I pasted last week, we've noticed that there are lots of packets with the CE flag, that is a notification of TCP congestion. We're not network experts, but have found that this could deal us to a 3 seconds of delay before the packet retransmission... could anyone give us some help about that?
Thanks. Kind regards.