WCF - Random client timeout when making multiple calls

Question

I have a WPF client requesting data via WCF service hosted in IIS 7. The service method makes a call to a stored procedure (SQL 2012) using EF to retrieve some data.

There's a lot of data to load so the client makes several calls to the service method in an effort to "break up" the data loading and avoid large payloads and timeouts.

We use the generated service proxies that extend from System.ServiceModel.ClientBase<T>.

We're also using a custom http binding with binary encoding (from here) - actual implementation shown here:

<customBinding>
   <binding name="CustomBinding_IPointDataAccess" closeTimeout="00:01:00"
      openTimeout="00:01:00" receiveTimeout="00:10:00" sendTimeout="00:01:00">
      <binaryMessageEncoding maxReadPoolSize="64" maxWritePoolSize="16" maxSessionSize="2048">
      <readerQuotas maxDepth="2147483647" maxStringContentLength="2147483647" maxArrayLength="2147483647" maxBytesPerRead="2147483647" maxNameTableCharCount="16384" />
      </binaryMessageEncoding>
      <httpTransport manualAddressing="false" maxBufferPoolSize="524288"
         maxReceivedMessageSize="2147483647" allowCookies="false" authenticationScheme="Anonymous" bypassProxyOnLocal="false" decompressionEnabled="true" hostNameComparisonMode="StrongWildcard" keepAliveEnabled="true" maxBufferSize="2147483647" proxyAuthenticationScheme="Anonymous" realm="" transferMode="Buffered" unsafeConnectionNtlmAuthentication="false" useDefaultWebProxy="true" />
   </binding>

Also, dynamic compression is turned on in IIS. I can view the requests in Fiddler, and the size of the message's body is fine (~50KB) and 99% of the requests return in a second or two. Perfect!

However, with almost every iteration, there is one call in the bunch that takes minutes to complete, and I don't know why... My sendTimeOut on the client was at 1 minute and naturally that one call would fail. I extended it to 10 minutes, and the call seems to complete in a little more than 2 minutes - although sometimes it would take even longer. The problem appears very random - it could be the first call, it could be the 30th call. But it's very reproducible.

I placed some logging around the the stored procedure call in the WCF service method and it executes and gets data back in under a second. So, I don't think it's a database issue.

Using Fiddler, the problematic call generates output similar to the following:

ACTUAL PERFORMANCE
--------------
ClientConnected:     14:02:42.959
ClientBeginRequest:  14:03:01.224
GotRequestHeaders:   14:03:01.224
ClientDoneRequest:   14:03:01.574
Determine Gateway:   0ms
DNS Lookup:      0ms
TCP/IP Connect:  46ms
HTTPS Handshake:     0ms
ServerConnected:     14:05:16.021
FiddlerBeginRequest: 14:05:16.021
ServerGotRequest:    14:05:16.021
ServerBeginResponse: 14:03:04.784
GotResponseHeaders:  14:05:16.561
ServerDoneResponse:  14:05:16.611
ClientBeginResponse: 14:05:16.611
ClientDoneResponse:  14:05:16.611

Notice the significant time between ServerBeginResponse and GotResponseHeaders. This seems strikingly similar to the issue seen here.

I enabled WCF Service Tracing, and at a quick glance, there are no errors or warnings, but I can't really make much sense out of what I'm looking at beyond the basics.

How can I nail down what and where the problem lies? Is it serialization? Is it a network issue? Can the server not keep up with the client sending so many requests?

I've tried adjusting the WCF Throttling in the config file by adding the appropriate serviceBehaviors, but this did not make a difference.

I should mention that I am doing this over a VPN connection, but other things like file transfers, remote desktop connections work just fine. It seems pretty reliable.

I can provide more details if necessary.

Edit (6.10.2013): Not sure if this is related or just a fluke, but a couple times, I've noticed that on the problematic call, the Body size is significantly less than the others. This is not the case each time, but it may provide some clues. Here is a screen capture from Fiddler to show you how consistent the Body size should be with each call. The selected entry (#21) is much smaller than the others in size, but takes over 2 minutes to complete.

enter image description here

Oddly enough, this time I received an exception. The exception does not happen every time.

System.ServiceModel.CommunicationException: The server did not provide a meaningful reply; this might be caused by a contract mismatch, a premature session shutdown or an internal server error.

Server stack trace: 
   at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs)
   at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)
   at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)

Exception rethrown at [0]: 
   at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
   at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)

A few questions; are these requests parallel or sequential? Have you tried adjusting the Maximum worker Processes in the Advanced AppPool settings for this WCF Service in IIS? (This will turn it into a Web Garden). I've seen that communication error occur when I've overloaded what my service host can handle in simultaneous requests. — Kaiser12, Jun 10 '13 at 12:45
(1) These requests are sequential. One call will complete before the next one is invoked. (2) I have not tried adjusting the Maximum Worker Processes in the Advanced AppPool settings. Is this still applicable if the calls are sequential? Currently, this is set to 1. — John Russell, Jun 10 '13 at 12:49
are you able to remove a few more variables? eg reproduce it NOT on the VPN to rule that out. also, are you able to reproduce it using the *same* message ? — wal, Jun 10 '13 at 12:53
@wal, I installed and ran the application on the server itself and the query completed no problem. I also tried using a remote client on the same network. Again no problems. That points to some instability/hiccup in the VPN. But it's strange because it occurs so regularly. It's almost as if there's a connection reset that happens periodically. Nevertheless, I think our application needs to be able to handle this scenario gracefully. Does WCF offer any built-in mechanism to better handle a hiccup in connectivity? It seems that a slight hiccup can cripple a WCF call. — John Russell, Jun 10 '13 at 14:15
`Does WCF offer any built-in mechanism to better handle a hiccup in connectivity?` <--yes you could try `reliableSession` which will resend the entire message if its not received. i'm not a huge fan of this it does not work well for medium-to-large messages on slow connections but it *may* work for you. i'm tempted in your scenario to set up a socket server (using `TcpListener`) as a substitute for WCF at least in the short term to see if that works any better. (again may be a significant investment in time but will further isolate the issue) — wal, Jun 10 '13 at 14:39
you might want to paste your bindings (and your link in your post is dead too) — wal, Jun 10 '13 at 14:45
do you expect the failed message to be so much smaller than the others that work? could dynamic compresison be causing this? (can you disable this for a short period) — wal, Jun 10 '13 at 14:55
@wal, updated broken link which documented the approach we took to get the binary encoding over http. Also, pasted the customBinding definition we're using. — John Russell, Jun 10 '13 at 15:03
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/31531/discussion-between-wal-and-john-russell) — wal, Jun 10 '13 at 15:21
@JohnRussell: even though your requests are sequential, i hope you are cleaning up the proxies properly after you're done with them. It could possibly be related to higher generation GC being invoked in the server process. If you have 2 minute interval, you could try to obtain a IIS process memory dump when you see the delayed response. This can be inspected with VS2012/WinDBG to verify what's in memory or whether GC is being invoked. — YK1, Jun 10 '13 at 16:37
FWIW, the fact that ServerGotRequest > ServerBeginResponse suggests that the first connection failed (e.g. timed out) and the client resent the request on a fresh connection. — EricLaw, Jun 10 '13 at 16:53
On the client side, are you instantiating a new instances of the service proxy each time you call the service? And if so, are you closing the channels correctly? — Dan Ling, Jun 10 '13 at 16:56
@Dan Ling, I've tried both creating new proxies with each call and using the same proxy for each call, but it did not make a difference in this case. Either way, the proxy creation is wrapped in a using statement. Of course, this assumes the Dispose inside of ServiceBase closes channels correctly. — John Russell, Jun 10 '13 at 17:33
I suggest that you try setting transfermode to streamed to rule out the possibility that this is an issue related to memory pressure (since stream mode should cause wcf to utilize less memorY) — Dan Ling, Jun 10 '13 at 18:09
@YK1, I created a dump file during a delayed response and sent it to DebugDiag which generates a nifty report. The only thing that stuck out was what seems to be a significant number of HttpContext objects: Total 82 HttpContext objects. — John Russell, Jun 10 '13 at 18:13
@Dan Ling, I set the transferMode="Streamed" inside the httpTransport element on both client and server, and this actually did the trick! It's worked several times now even over the VPN connection. Do you think it was a memory issue? Why would this have made a difference? I'm completely new to this setting. Will this impact other services on this endpoint? If you want to rephrase your comment to an answer, I will accept it. — John Russell, Jun 10 '13 at 18:42
@JohnRussell, This setting exists at the binding level, so yes, it will affect all the other services on that endopint. I don't believe there is any way to set it at a more granular level. You'd have to create separate bindings for buffered & streamed. However, I don't see any drawback to using streamed if performance is good enough to meet your requirements. — Dan Ling, Jun 10 '13 at 19:20
@JohnRussell: Thats great you've got a solution. Now that you have a dump file, you can also investigate whether the memory pressure was created by your code in the first place. Each HttpContext represents a http request, but they might be just waiting there to get GC'ed. Also, you can run `!EEstack -EE` over your dump in WinDBG to see what all threads were doing. — YK1, Jun 10 '13 at 19:25

score 5 · Accepted Answer · answered Jun 10 '13 at 19:09

As I suggested in the comments, please try setting transfermode to streamed to rule out the possibility that this is an issue related to memory pressure (since stream mode should cause wcf to utilize less memory).

When I saw this issue suspected this might be the problem because it only seems to happen to you when making many service calls in rapid succession. In my experience this is often one of two problems: either the proxies aren't being properly closed from the client, or the server is running GC's due to memory pressure.

When the transfermode is buffered, WCF loads the entire dataset for the message response into memory before sending it back to the client. Streamed simply sends the data back without buffering. It tends to be much faster for large data sets, slightly slower for small data sets, and always utilizes less memory(on both the server and client).

score 2 · Answer 2 · answered Jun 10 '13 at 15:26

To find out what causes your timeouts you should trace what is going on in WCF. Adding the following to your config files will generate a trace file on your client and server:

<system.diagnostics>
   <sources>
       <source name="UserTraceSource" switchValue="Warning, ActivityTracing" >
          <listeners>
              <add name="xml"
                 type="System.Diagnostics.XmlWriterTraceListener"
                 initializeData="C:\logs\UserTraces.svclog" />
          </listeners>
       </source>
   </sources>
   <trace autoflush="true" /> 
</system.diagnostics>

Normally the file will tell you exactly what is going on and what fails. Make sure you have the C:\logs dir and the user has write permissions on the directory.

Configure wcf tracing

Thanks for the response. I enabled WCF Tracing, but I did not see any failures or warnings. — John Russell, Jun 10 '13 at 17:23

WCF - Random client timeout when making multiple calls

2 Answers2

Linked