1

We have a Java 8 application served by an Apache Tomcat 8 behind an Apache server, which is requesting multiple webservices in parallel using CXF. From time to time, there's one of them which lasts exactly 3 more seconds than the rest (which should be about 500ms only).

I've activated CXF debug, and now I have the place inside CXF where the 3 seconds are lost:

14/03/2018 09:20:49.061 [pool-838-thread-1] DEBUG o.a.cxf.transport.http.HTTPConduit - No Trust Decider for Conduit '{http://ws.webapp.com/}QueryWSImplPort.http-conduit'. An affirmative Trust Decision is assumed. 
14/03/2018 09:20:52.077 [pool-838-thread-1] DEBUG o.a.cxf.transport.http.HTTPConduit - Sending POST Message with Headers to http://172.16.56.10:5050/services/quertServices Conduit :{http://ws.webapp.com/}QueryWSImplPort.http-conduit

As you could see, there're three seconds between these two lines. When the request is ok, it usually takes 0ms in between these two lines.

I've been looking into the CXF code, but no clue about the reason of these 3 secs...

The server application (which is also maintained by us), is served from another Apache Tomcat 6.0.49, which is behind another Apache server. The thing is that it seems that the server's Apache receives the request after the 3 seconds.

Anyone could help me?

EDIT: We've monitored the server's send/received packages and it seems that the client's server is sending a negotiating package at the time it should, while the server is replying after 3 seconds. These are the packages we've found:

481153  11:31:32 14/03/2018 2429.8542795    tomcat6.exe SOLTESTV010     SOLTESTV002     TCP TCP:Flags=CE....S., SrcPort=65160, DstPort=5050, PayloadLen=0, Seq=2858646321, Ack=0, Win=8192 ( Negotiating scale factor 0x8 ) = 8192  {TCP:5513, IPv4:62}
481686  11:31:35 14/03/2018 2432.8608381    tomcat6.exe SOLTESTV002     SOLTESTV010     TCP TCP:Flags=...A..S., SrcPort=5050, DstPort=65160, PayloadLen=0, Seq=436586023, Ack=2858646322, Win=8192 ( Negotiated scale factor 0x8 ) = 2097152    {TCP:5513, IPv4:62}

481687  11:31:35 14/03/2018 2432.8613607    tomcat6.exe SOLTESTV010     SOLTESTV002     TCP TCP:Flags=...A...., SrcPort=65160, DstPort=5050, PayloadLen=0, Seq=2858646322, Ack=436586024, Win=256 (scale factor 0x8) = 65536    {TCP:5513, IPv4:62}
481688  11:31:35 14/03/2018 2432.8628380    tomcat6.exe SOLTESTV010     SOLTESTV002     HTTP    HTTP:Request, POST /services/consultaServices    {HTTP:5524, TCP:5513, IPv4:62}

So, it seems is the server's Tomcat is the one which is blocked with something. Any clue?

EDIT 2:

Although that happened yesterday (the first server waiting 3s for the ack of the second), this is not the most common scenario. What it usually happens is what I described at the beginning (3 seconds between the two CXF's logs and the server receiving any request from the first one after 3 seconds.

There has been some times when the server (the one which receives the request), hangs for 3 seconds. For instance:

  1. Server 1 sends 5 requests at the same time (suppossedly) to server 2.

  2. Server 2 receives 4 of them, in that same second, and start to process them.

  3. Server 2 finish processing 2 of those 4 requests in 30ms and replies to server 1.

  4. More or less at this same second, there's nothing registered in the application logs.

  5. After three seconds, logs are registered again, and the server finish to process the remaining 2 requests. So, although the process itself is about only some milliseconds, the response_time - request_time is 3 seconds and a few ms.

  6. At this same time, the remaining request (the last one from the 5 request which were sent), is registered in the network monitor and is processed by the application in just a few milliseconds. However, the global processing time is anything more than 3s, as it has reached the server 3 seconds after being sent.

So there's like a hang in the middle of the process. 2 requests were successfully processed before this hang and replied in just a fraction of a second. 2 other request lasted a little bit more, the hang happened, and ended with a processing time of 3 seconds. The last one, reached the server just when the hang happened, so it didn't get into the application after the hang.

It sounds like a gc stop the world... but we have analyzed gc.logs and there's nothing wrong with that... could there be any other reason?

Thanks!

EDIT 3:

Looking at the TCP flags like the one I pasted last week, we've noticed that there are lots of packets with the CE flag, that is a notification of TCP congestion. We're not network experts, but have found that this could deal us to a 3 seconds of delay before the packet retransmission... could anyone give us some help about that?

Thanks. Kind regards.

Goyo
  • 455
  • 1
  • 9
  • 23

2 Answers2

1

Finally, it was everything caused by the network congestion we discovered looking at the TCP flags. Our network admins has been looking at the problem, trying to reduce the congestion, reducing the timeout to retransmit.

Goyo
  • 455
  • 1
  • 9
  • 23
0

The thing is that it seems that the server's Apache receives the request after the 3 seconds.

How do you figure this out ? If you're looking at Apache logs, you can be misleaded by wrong time stamps.

I first thought that your Tomcat 6 takes 3 seconds to answer instead of 0 to 500ms, but from the question and the comments, it is not the case.

Hypothesis 1 : Garbage Collector

The GC is known for introducing latency. Highlight the GC activity in your logs by using the GC verbosity parameters. If it is too difficult to correlate, you can use the jstat command with the gcutil option and you can compare easily with the Tomcat's log.

Hypothesis 2 : network timeout

Although 3s is a very short time (in comparison with the 21s TCP default timeout on Windows for example), it could be a timeout. To track the timeouts, you can use the netstat command. With netstat -an , look for the SYN_SENT connections, and with netstat -s look for the error counters. Please check if there is any network resource that must be resolved or accessed in this guilty webservice caller.

Eugène Adell
  • 3,089
  • 2
  • 18
  • 34
  • I saw that looking to the Microsoft Network Monitor, which registered the 4 lines I pasted in my edit. The first one is only replied after 3 seconds. I think the application inside the tomcat has not received anything by this time. The application itself doesn't take 3s in finishing. It always takes about 500ms (at max) when processing. We have GC with verbose, because we felt it could be the problem, but we haven't found anything wong in its gc log. And it's not possible to make a thread dump when this happens, as this is happenning randomly and only lasts for 3 seconds. – Goyo Mar 15 '18 at 07:27
  • I've re-read your question and your comment, I understand this is always the same call which gives sometimes the problem, and the problem always looks the same : 3s before sending the request. 3s is too short for launching a Thread Dump, I agree, but you can monitor with JCONSOLE or VISUALVM to see the consuming threads maybe, but it will be fastidious to wait for the problem to repeat. I will edit my answer with more ideas on how to proceed. – Eugène Adell Mar 15 '18 at 08:05
  • Thanks Eugène. I've updated my original post with some other scenarios. We'll have a look to the netstat, as you mention – Goyo Mar 15 '18 at 11:01