2

I have a Web Role deployed on Azure which open a TCP connection to a remote server using socket (C#). This connection MUST be always opened. After twenty minutes it seems the connection is lost.

So I'm wondering if a Web Role is designed/can host a such connection? Is there any automated process that could close the TCP connection (recycling for instance)?

The running code works fine on my computer and worked well when I was using a 'standard' Windows Service on a dedicated server.

Thanks for your help, Jerome.

Jerome Thievent
  • 295
  • 2
  • 8
  • sounds like a session timeout. why do you need an always on connection? – Kevin Cook Jul 24 '14 at 18:46
  • I'm connected to financial markets. – Jerome Thievent Jul 24 '14 at 18:49
  • It is not possible by principle to have a TCP connection guaranteed open (think network blip, app redeploy, OS reboot, crash, bug, ...). You *must* have a plan for the connection going down anyway. Probably, that would be a reconnect/retry strategy. – usr Jul 24 '14 at 19:09

3 Answers3

4

You have to change your architecture when moving to the cloud. In the cloud, at least the most commonly used cloud there is no such thing as 100% availability at first place. So no "always opened".

Next, there are quite a lot of factors that could close your connection. Over some you have control, over others you don't. Here is some non-conclusive list:

  • Role recycling due to hardware failure (you have no control)
  • Role recycling due to OS update (you have control - set specific OS version, but not recommended)
  • Idle connection timeout (you have control - keep connection alive by sending packets at regular or irregular intervals, but do not keep connection idle more than 60 seconds)
  • something else

In order to provide all users equal opportunity to use cloud resources, every cloud provider will try to manage these resources at best possible way. One such resource is TCP Sockets.

With Azure if a connection (any connection) is idle more than XXX seconds (this number has changed over time, so I don't want to quote any specific number, just assume it is 1 minute) it is being terminated.

At the end - there are too many factors in the cloud that would close your connection, so begin thinking the cloud way - implement retry logic in your protocols, and implement healing logic in your service. Last but not least - never use a single instance role if you seek high availability.

One interesting reading on connection timeouts can be found here. Although it only refers to connection within same datacentre (within roles and VMs) it is still worth reading.

And interesting update you have added - "I am connected to financial markets". Even so, you have to question 100% uptime vs. 99.95% uptime, which is the standard SLA for WebRoles with 2 instances. Never said it is easy, but you could achieve 99.95% availability with a minimum of 2 instance of a webrole (or worker role) and some kind of a watchdog that monitors for the connection. Always keep only one connection, if that goes down - immediately (when detected) open another connection. Keep required data in a redis cache for instance, or Azure Cache, or In-role Cache configured for high-availability.

There are solutions for high availability in the cloud. But if you look for 100%, this is not your place. There is no 100% SLA in the cloud.

astaykov
  • 30,768
  • 3
  • 70
  • 86
  • Thanks for your reply. I have a heartbeat mechanism to maintain the connection up. I really need an always opened connection and I must be sure the connection or the service will not going down (except if there is a bug or crash in my application). I don't want to depend on external factors. So maybe the cloud is not the best option for me... – Jerome Thievent Jul 24 '14 at 18:55
  • I really challenge you to question the 100% availability vs. retry logic and minimizing TTR - time-to-recover after failure. You may find a cloud vendor that offers 100% SLA, but when you compare the price, you will really want to come back to Azure ;) You will always depend on external factors. You just fool yourself if you think that you can create 100% uptime system on your own for a reasonable price - reasonable to face the revenue it will generate. – astaykov Jul 24 '14 at 18:57
  • I know :) But I'm very new on the cloud and I though a cloud service had the same behavior than a standard Windows Service. I'm going the continue my investigations and I'll post my response once I'll be sure where the problem is. – Jerome Thievent Jul 24 '14 at 19:04
  • 1
    "except if there is a bug or crash in my application" why would the business be ok with the connection being down due to a bug but not ok with a network failure or such. You simply must state to them that it is impossible to deliver 100% uptime. Find a way to cope with that fact. – usr Jul 24 '14 at 19:11
  • @JeromeThievent which is that standard windows service that gives you 100% uptime??? – astaykov Jul 25 '14 at 06:20
  • Do not forget the IIS App pool recycling every 29 hours :) – Ognyan Dimitrov Jul 25 '14 at 20:04
  • @astaykov, I'm not expected 100% uptime with a windows service, but what I want to avoid is a lost of my TCP connection every XX min. It seems there is a way to disable app pool recycling: http://stackoverflow.com/questions/18089487/disable-iis-idle-timeouts-in-azure-web-role. I'm going to try it. – Jerome Thievent Jul 26 '14 at 08:01
1

I suspect you're being disconnected because of the load balancer used in Azure. It used to disconnect idle connections after a minute, but I believe that this was changed to be 20 minutes (I can look for a reference for this later and update this answer accordingly).

The important part to note here is that it's only idle connections it terminates, so if you are using the connection it shouldn't disconnect you (although I have a sneaking suspicion that there may have been a maximum connection time as well).

Also note that by default IIS in an Azure role will recycle the App pool after 26 hours. This can be changed through changing IIS settings in a startup script.

Also any instance in Azure can be recycled at any time. It doesn't happen very often, but you can't stop it from happening. Your web role does receive an event to say this is happening though of you need to take some kind of action.

All this adds up to the fact that your remote server will need to be more flexible with the way it deals with this connection if you want to host in Azure.

knightpfhor
  • 9,299
  • 3
  • 29
  • 42
  • I found the source of the problem in the Event Viewer of my VM. At the exact time I lost the TCP connection there is this event: "A worker process with process id of '3724' serving application pool 'd2995196-148a-4f49-93d9-b9e110188cc8' was shutdown due to inactivity. Application Pool timeout configuration was set to 20 minutes. A new worker process will be started when needed.". It seems the app pool recycles services after 20 minutes. I'll try with a Worker Role instead of a Web Role. – Jerome Thievent Jul 24 '14 at 19:47
0

I've finally found a solution that matches my requirements. My TCP connection was closed due to role recycling. Following the solution described here: Disable IIS Idle Timeouts in Azure Web Role, I was able to configure the IIS AppPool when I deploy my service.

So basically I need to disable:

  • idle timeout (20 minutes by default)
  • period role recycling (was set to 29 hours by default)

Thanks all!

Community
  • 1
  • 1
Jerome Thievent
  • 295
  • 2
  • 8