We have a WCF netTcp service that serves as an entry-point for some legacy applications into a new service platform we have developed. The service itself isn't very interesting, it basically just has a one-line service operation that drops the request message in a mailbox for the new platform to process. The service generally works fine for weeks at a time, but sometimes after the service is restarted (it is self-hosted inside a larger windows service) it will crash and refuse to restart again unless the entire server is rebooted.
We enabled WCF tracing in a non-production environment and were able to recreate the issue, so we have some complete trace logs of the problem, but it hasn't helped us identify the root cause so far.
The first exception we see is a SocketException
with the message An existing connection was forcibly closed by the remote host
and Native Error Code 2746
. This occurs on SocketConnection.HandleReceiveAsyncCompleted
. The next exception (immediately following in the trace) is a CommunicationException
saying The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '10675199.02:48:05.4775807'.
After this, any attempt to use the WCF ServiceHost
object results in a CommunicationObjectFaultedException
with the standard The communication object, System.ServiceModel.ServiceHost, cannot be used for communication because it is in the Faulted state
message. Restarting the WCF service or even the entire windows service results in an immediate failure with the same exceptions.
Subsequent attempts to restart the WCF service automatically (using a new ServiceHost
object) always encounter two other exceptions. The first is an AddressAlreadyInUseException
saying There is already a listener on IP endpoint 0.0.0.0:11300. This could happen if there is another application already listening on this endpoint or if you have multiple service endpoints in your service host with the same IP endpoint but with incompatible binding configurations.
The second is an ArgumentException
with the message Instance 'PublisherService@net.tcp:||localhost:11300|PublisherService' already exists in CounterSet 'e829b6db-21ab-453b-83c9-d980ec708edd'.
Parameter name: InstanceName
.
For reference, the code inside the Windows service that manages this WCF service looks like this:
// Create a PublisherService to send WCF requests to the Publisher Actor's mailbox
let publisherService = PublisherService(fun request -> publisherActor <! request)
let rec initHost () =
// Create a new WCF ServiceHost for the Publisher Service
let host = new ServiceHost(publisherService, [| config.PublisherServiceUri |])
host.AddServiceEndpoint(typeof<IPublisherService>.FullName, NetTcpBinding(SecurityMode.None), config.PublisherServiceUri) |> ignore
// Log any WCF Service Faulted events and try to start a new ServiceHost
host.Faulted |> Event.add (fun _ ->
log.errorf "Publisher Service '%s' faulted. Restarting in 10 seconds..." config.PublisherServiceUri
// BUG: Always fails with CommunicationObjectFaulted exception -- probably shouldn't be here
try host.Close ()
with | ex -> log.errorx ex "Error closing faulted Service Host"
// TODO: Do something smarter than just sleeping for 10 seconds before restarting the service host
Async.Sleep 10000 |> Async.RunSynchronously
// BUG: This isn't tail-recursive, will probably blow the stack eventually
initHost () |> ignore)
// Start the WCF Service Host
host.Open()
log.debugf "Publisher Service '%s' started." config.PublisherServiceUri
initHost()
I realize this code has some problems, but I don't understand why even restarting the entire Windows Service doesn't allow the WCF service to start-up cleanly. Once the WCF service is running, it seems to be fine indefinitely, but if we restart it for any reason, there's a chance it will go into this state, after which it becomes completely unrecoverable without a server reboot. If anyone could offer any guidance on further troubleshooting the underlying issue, or knows of anything that may be causing this behavior, I would appreciate the assistance.