2

We have a WCF netTcp service that serves as an entry-point for some legacy applications into a new service platform we have developed. The service itself isn't very interesting, it basically just has a one-line service operation that drops the request message in a mailbox for the new platform to process. The service generally works fine for weeks at a time, but sometimes after the service is restarted (it is self-hosted inside a larger windows service) it will crash and refuse to restart again unless the entire server is rebooted.

We enabled WCF tracing in a non-production environment and were able to recreate the issue, so we have some complete trace logs of the problem, but it hasn't helped us identify the root cause so far.

The first exception we see is a SocketException with the message An existing connection was forcibly closed by the remote host and Native Error Code 2746. This occurs on SocketConnection.HandleReceiveAsyncCompleted. The next exception (immediately following in the trace) is a CommunicationException saying The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '10675199.02:48:05.4775807'.

After this, any attempt to use the WCF ServiceHost object results in a CommunicationObjectFaultedException with the standard The communication object, System.ServiceModel.ServiceHost, cannot be used for communication because it is in the Faulted state message. Restarting the WCF service or even the entire windows service results in an immediate failure with the same exceptions.

Subsequent attempts to restart the WCF service automatically (using a new ServiceHost object) always encounter two other exceptions. The first is an AddressAlreadyInUseException saying There is already a listener on IP endpoint 0.0.0.0:11300. This could happen if there is another application already listening on this endpoint or if you have multiple service endpoints in your service host with the same IP endpoint but with incompatible binding configurations. The second is an ArgumentException with the message Instance 'PublisherService@net.tcp:||localhost:11300|PublisherService' already exists in CounterSet 'e829b6db-21ab-453b-83c9-d980ec708edd'. Parameter name: InstanceName.

For reference, the code inside the Windows service that manages this WCF service looks like this:

// Create a PublisherService to send WCF requests to the Publisher Actor's mailbox
let publisherService = PublisherService(fun request -> publisherActor <! request)

let rec initHost () =                        
    // Create a new WCF ServiceHost for the Publisher Service
    let host = new ServiceHost(publisherService, [| config.PublisherServiceUri |])
    host.AddServiceEndpoint(typeof<IPublisherService>.FullName, NetTcpBinding(SecurityMode.None), config.PublisherServiceUri) |> ignore

    // Log any WCF Service Faulted events and try to start a new ServiceHost
    host.Faulted |> Event.add (fun _ -> 
        log.errorf "Publisher Service '%s' faulted. Restarting in 10 seconds..." config.PublisherServiceUri
        // BUG: Always fails with CommunicationObjectFaulted exception -- probably shouldn't be here
        try host.Close ()
        with | ex -> log.errorx ex "Error closing faulted Service Host"
        // TODO: Do something smarter than just sleeping for 10 seconds before restarting the service host
        Async.Sleep 10000 |> Async.RunSynchronously
        // BUG:  This isn't tail-recursive, will probably blow the stack eventually
        initHost () |> ignore)

    // Start the WCF Service Host
    host.Open()
    log.debugf "Publisher Service '%s' started." config.PublisherServiceUri

initHost()

I realize this code has some problems, but I don't understand why even restarting the entire Windows Service doesn't allow the WCF service to start-up cleanly. Once the WCF service is running, it seems to be fine indefinitely, but if we restart it for any reason, there's a chance it will go into this state, after which it becomes completely unrecoverable without a server reboot. If anyone could offer any guidance on further troubleshooting the underlying issue, or knows of anything that may be causing this behavior, I would appreciate the assistance.

Aaron M. Eshbach
  • 6,380
  • 12
  • 22
  • 2
    The "There is already a listener on IP endpoint 0.0.0.0:11300" error suggests to me that the socket is still in the `TIME_WAIT` state, and you'll need to wait up to 4 minutes (usually) for the waiting socket to be closed. See https://stackoverflow.com/a/1436261/2314532 for some more details, and https://serverfault.com/a/329846/246397 has a pretty good explanation (it's Linux-based, but the underlying TCP stack should be basically the same on Windows since the TIME_WAIT behavior specified in the RFCs that both Linux and Windows have to implement). – rmunn Aug 28 '18 at 14:25
  • @rmunn Thanks, I'll look into that. – Aaron M. Eshbach Aug 28 '18 at 14:27
  • @rmunn Would I see the port 11300 in `TIME_WAIT` in TcpView if this were the problem? – Aaron M. Eshbach Aug 28 '18 at 20:15
  • Either `TIME_WAIT` or `CLOSE_WAIT`; according to https://superuser.com/a/173542/376367, the difference is that `TIME_WAIT` means "this end closed the connection", while `CLOSE_WAIT` means "the remote end closed the connection". Either way, the TCP stack has to wait four minutes before allowing that port to be reused. – rmunn Aug 28 '18 at 23:24
  • @rmunn I tested this in a development environment, and I did not see the port go to `TIME_WAIT` or `CLOSE_WAIT` in TcpView. It went directly from `LISTENING` to not appearing in the list. I also reviewed some logs from production and it appears that we have had gaps up to 25 minutes between the WCF service crashing and restarting the Windows service, but the problem still occurs immediately after the service is restarted. We have opened a ticket with Microsoft, and I will post the answer if they help us figure it out. – Aaron M. Eshbach Aug 29 '18 at 12:04

0 Answers0