6

Our issue is that Azure App Service (S3 x 5 Instances) is not evenly distributing requests across the 5 instances. The result is that one instance is getting swamped with requests and our overall P50 & P95 response time SLA for that app service is being breached.

I've confirmed that the App Service has ARR Affinity turned off. It's a completely stateless web API so there's nothing inherently sticky about it.

Tech details below but the question is essentially this

Why isn't Azure evening distributing/round-robin-ing my traffic across all 5 instances?

As it stands, scaling up or out doesn't seem to make sense here because I just end up with additional expensive instances sitting idle while 1 instance gets swamped.

Technical Details

The following 2 charts from app insights, from June 1st & June 25th show the issue.

requests
| where timestamp > datetime("2020-06-25 00:00:00")  
| where timestamp < datetime("2020-06-25 08:00:00")
//comaprison between 00:00-08:00 on June 1st vs. Today
| where url contains "**ommitted**" 
| project cloud_RoleInstance, itemCount, bin(timestamp, 1h)
| evaluate pivot(cloud_RoleInstance, sum(itemCount))
| render timechart

This first image below shows the traffic distribution on June 1st. not perfectly distributed but close. the 3rd server is taking on about 50% more traffic than the 5th server

34,708    26,436    38,313    30,617    24,355
22%       17%       25%       20%       16%

June 1st

This next image below shows the traffic distribution for the same time frame this morning... The 4th instance is handling 250% more traffic than the next closest instance and 600% more than instance 1

11,980  21,671  34,180  85,041  24,508
7%      12%     19%     48%     14%

June 25th

Eoin Campbell
  • 43,500
  • 17
  • 101
  • 157

1 Answers1

1

Unfortunately you do not have any power over the load balancer used when you scale out your applications. It is not configurable and is supposed to send the requests to instances randomly as far as I know.

Although, judging by the attached graphs your distribution is quite balanced in the first one. Of course the second day you presented there is a clear issue, but I can imagine that this could only be temporarily.

Randomness includes statistics, and statistically it is possible that more requests go to one of your instances in small time windows (limited sampling).

I would suggest that you get more samples regarding the load balancing because only two days is not enough. I am pretty sure that the more data you collect the more you will see the the curves will converge.

I can understand that SLA is a problem and for that I would suggest upgrading to another tier so your requests are served faster.

Stelios Giakoumidis
  • 2,153
  • 1
  • 7
  • 19
  • 1
    I don't believe this is statistics or randomness. and I have plenty of other data points to back this up. I can run the above query for every day at any granularity for a 30day period. Over literally 100s millions of requests there's is not discernable reason that this level of disparity should exist. it should converge. it does not. remains divergent. – Eoin Campbell Jun 25 '20 at 14:14
  • @EoinCampbell I think that you misunderstood me. What I said is that load balancer is distributing the requests in a random manner. Since randomness in computer science does not exist (and there are no specific details in documentation pages on how exactly the traffic is managed) I would expect disparity to exist along some time periods. But it is hard for me to believe that a request distribution graph with such a disparity would exist if you plot requests over the last month or two for example (longer scale). – Stelios Giakoumidis Jun 26 '20 at 06:42
  • :facepalm: I'm aware of the concepts of randomness vs. pseudo-randomness in computer science. this has nothing to do with that. The azure load balancer "product" allows you to various modes. hash mode vs. source IP affinity mode. See here: https://learn.microsoft.com/bs-latn-ba/azure/load-balancer/load-balancer-distribution-mode. My only assumption here is that the LB-tech in front of app services is the same as the LB-tech that's offered as a product. (... 1/3) – Eoin Campbell Jun 26 '20 at 09:14
  • For the volume of traffic we are getting from the wide variety of end-user all over the world, even with session affinity LB'ing there is no reason the distribution should look like that. Re: your comment on having more "samples" I've tonnes of data. Here's what it looks like for 90 days at 1day granularity over half a billion requests. See how it works sometimes / doesn't work other times. Nothing eventually "converges" https://imgur.com/a/Ucb2cAJ (... 2/3) – Eoin Campbell Jun 26 '20 at 09:15
  • As for "Scaling up to another tier". This would, with the information presented be a massive waste of money. right now we have 5 machines where 1 is really busy, and 4 are not. Why would I spend more money scaling all instances to a more expensive tier only to have them being idle @ that more expensive scale. (3/3) – Eoin Campbell Jun 26 '20 at 09:17
  • 1) I urge you to read the answer carefully before answering. I said "Unfortunately you do not have any power over the load balancer used when you scale out your applications.". Because you can not configure the load balancer in scaled out applications! I am fully aware of the load balancer "product" and that it is manageable in several ways. But when you say "Azure App Service (S3 x 5 Instances)" it implies that you have a scaled out app, and not 5 instances behind a load balancer "product". – Stelios Giakoumidis Jun 26 '20 at 10:17
  • 2) If you scale up your application you will NOT need 5 instances. You will need fewer of course! So it is not a waste of money, it is matter of balancing your load into fewer servers but better resourced. – Stelios Giakoumidis Jun 26 '20 at 10:17
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/216712/discussion-between-stelios-giakoumidis-and-eoin-campbell). – Stelios Giakoumidis Jun 26 '20 at 10:25
  • 4
    Was there ever a solution to this? I can't see the discussion that finished out of the post... I seem to be getting a similar issue, 6 instances, it was 10 minutes under high load before a second instance started working, then a further 25 minutes before a third instance started working... 3 don't appear to have been used at all... We're using service end points and a vnet... ARR affinity is switched off... – Daniel Jun 05 '22 at 16:36