5

I’m investigating a scenario with a live dashboard (Angular web app) that is refreshed every 5 seconds (polling). The API is sitting behind Azure Traffic Manager which will fail over to a second region in the event of a failure in the primary region. Keep in mind, Azure Traffic Manager works at the DNS level.

The problem I am facing is that the browser maintains a persistent connection to the primary region even after the Traffic Manager has failed over. The requests initially fail with 503s, but then continue to fail with 502s. The DNS lookup is never performed again as the requests occur more frequently than the keep-alive timeout. This causes the browser to continue to make requests to the failed region.

Is there anyway to explicitly kill the connection to force a DNS lookup? The only way I’ve found so far is to stop making requests for 2 minutes, or to close and reopen the browser. Neither is an acceptable solution for a dashboard that is supposed to be hands off and always fresh.

What’s interesting is after getting the browser to fail over to the secondary region, if I restart the primary region the browser will automatically switch back to the primary region after about a minute. This tells me the connection is respecting the DNS TTL when the service is functioning properly, but not when the server is unavailable. This makes no sense to me why the browser would lock onto a single IP forever when it’s not found.

Is there something I am missing about implementing georedundant failover with Traffic Manager for a web application? It seems very odd to me that the user would have to stop making requests for 2 minutes in any scenario before the browser would renegotiate the IP to the failed over server. Is it expected to turn of keep-alive to truly support near instant failover?

Here's a diagram that describes this scenario: Diagram enter image description here

Nancy
  • 26,865
  • 3
  • 18
  • 34
Eric Marunde
  • 151
  • 3

1 Answers1

1

Generally, Azure Traffic Manager works at the DNS level. Clients connect to the service endpoint directly, not through Traffic Manager. Traffic Manager has no way to track individual clients and cannot implement 'sticky' sessions.

  • For initial DNS lookup performance impact, you could find the explanation details here1 and here2

DNS name resolution is fast and results are cached. The speed of the initial DNS lookup depends on the DNS servers the client uses for name resolution. Typically, a client can complete a DNS lookup within ~50 ms. The results of the lookup are cached for the duration of the DNS Time-to-live (TTL). The default TTL for Traffic Manager is 300 seconds.

The TTL value of each DNS record determines the duration of the cache. Shorter values result in faster cache expiry and Longer values mean that it can take longer to direct traffic away from a failed endpoint. Traffic Manager allows you to configure the TTL as low as 0 seconds and as high as 2,147,483,647 seconds. You could choose the value that best balances the needs of your application.

  • Like the above, if you want the DNS lookup faster, you could set the TTL value as low as possible. Once the connection set up, the clients persistently connect to the selected endpoint until the endpoint is unhealthy via the health check.
  • You can enable and disable Traffic Manager profiles and endpoints. However, a change in endpoint status also might occur as a result of Traffic Manager automated settings and processes.. Get more details here.
  • For Geographic routing method,

The endpoint mapped to serve the geographic location based on the query request IP’s is returned. If that endpoint is unavailable, another endpoint will not be selected to failover to, since a geographic location can be mapped only to one endpoint in a profile (more details are in the FAQ). As a best practice, when using geographic routing, we recommend customers to use nested Traffic Manager profiles with more than one endpoint as the endpoints of the profile.

Community
  • 1
  • 1
Nancy
  • 26,865
  • 3
  • 18
  • 34
  • Thank you for your reply. However, the problem I am seeing is that once the server is unavailable, the client never performs a DNS Lookup again (regardless of TTL). I've waited upwards of 40 minutes with a TTL of 30 seconds. This only happens when the connection is not allowed to go idle (due to polling), AND the server is unavailable (503/502 errors). Note - the DNS TTL works as expected when the server is online (even when the connection remains open due to polling). – Eric Marunde Mar 14 '19 at 12:20
  • No idea. Perhaps, you may share some logs of TM for further troubleshooting or open a support ticket. – Nancy May 03 '19 at 07:02