4

I'm load testing my auto scaling AWS ECS Fargate stack which comprises of:

  • Application Load Balancer (ALB) with a target group pointing to ECS,
  • ECS Cluster, Service, Task, ApplicationAutoScaling::ScalableTarget, and ApplicationAutoScaling::ScalingPolicy,
  • the application auto scaling policy defines a target tracking policy:
    • type: TargetTrackingScaling,
    • PredefinedMetricType: ALBRequestCountPerTarget,
    • threshold = 1000 requests
    • alarm is triggered when 1 datapoint breaches the threshold in the past 1 minute evaluation period.

This all works fine. The alarms do get triggered and I see the scale out actions happening. But it feels slow to detect the "threshold breach". This is the timing of my load test and AWS events (collated here from different places in the JMeter logs and the AWS console):

10:44:32 start load test (this is the first request timestamp entry in JMeter logs)
10:44:36 4 seconds later (in the the JMeter logs), we see that the load test reaches it's 1000th request to the ALB. At this point in time, we're above the threshold and waiting for AWS to detect that...
10:46:10 1m34s later, I can finally see the spike show up in alarm graph on the cloudwatch alarm detail page BUT the alarm is still in OK state!
    NOTE: notice the 1m34s delay in detecting the spike, if it gets a datapoint every 60 seconds, it should be MAX 60 seconds before it detects it: my load test blasts out 1000 request every 4 seconds!!
10:46:50 the alarm finally goes from OK to ALARM state
    NOTE: at this point, we're 2m14s past the moment when requests started pounding the server at a rate of 1000 requests every 6 seconds!
    NOTE: 3 seconds later, after the alarm finally went off, the "scale out" action gets called (awesome, that part is quick):
14:46:53 Action Successfully executed action    arn:aws:autoscaling:us-east-1:MYACCOUNTID:scalingPolicy:51f0a780-28d5-4005-9681-84244912954d:resource/ecs/service/my-ecs-cluster/my-service:policyName/alb-requests-per-target-per-minute:createdBy/ffacb0ac-2456-4751-b9c0-b909c66e9868
After that, I follow the actions in the ECS "events tab":
10:46:53 Message: Successfully set desired count to 6. Waiting for change to be fulfilled by ecs. Cause: monitor alarm TargetTracking-service/my-ecs-cluster-cce/my-service-AlarmHigh-fae560cc-e2ee-4c6b-8551-9129d3b5a6d3 in state ALARM triggered policy alb-requests-per-target-per-minute
10:47:08 service my-service has started 5 tasks: task 7e9612fa981c4936bd0f33c52cbded72 task e6cd126f265842c1b35c0186c8f9b9a6 task ba4ffa97ceeb49e29780f25fe6c87640 task 36f9689711254f0e9d933890a06a9f45 task f5dd3dad76924f9f8f68e0d725a770c0.
10:47:41 service my-service registered 3 targets in target-group my-tg
10:47:52 service my-service registered 2 targets in target-group my-tg
10:49:05 service my-service has reached a steady state.
    NOTE: starting the tasks took 33 seconds, this is very acceptable because I set the HealthCheckGracePeriodSeconds to 30 seconds and health check interval is 30 seconds as well)
    NOTE: 3m09s between the time the load starting pounding the server and the time the first new ECS tasks are up
    NOTE: most of this time (3m09s) is due to the waiting for the alarm to go off (2m20s)!! The rest is normal: waiting for the new tasks to start. 

Q1: Is there a way to make the alarm trigger faster and/or as soon as the threshold is breached? To me, this is taking 1m20s too much. It should really scale up in around 1m30s (1m max to detect the ALARM HIGH state + 30 seconds to start the task)...

Note: I documented my CloudFormation stack in this other question I opened today: Cloudformation ECS Fargate autoscaling target tracking: 1 custom alarm in 1 minute: Failed to execute action

Pierre
  • 2,335
  • 22
  • 40

1 Answers1

11

You can't do much about it. ALB sends metrics to CloudWatch in 1 minute intervals. Also these metrics are not real-time anyway, so delays are expected, even up to few minutes as explained by AWS support and reported in the comments here:

Some delay in metrics is expected, which is inherent for any monitoring systems- as they depend on several variables such as delay with the service publishing the metric, propagation delays and ingestion delay within CloudWatch to name a few. I do understand that a consistent 3 or 4 minute delay for ALB metrics is on the higher side.

You would either have to over-provision your ECS to sustain the increased load by the time the alarms fires and the up-scaling executes, or reduce your thresholds.

Alternative, you can create your own custom metrics, e.g. from your app. These metrics can be even with 1 second intervals. Your app could also "manually" trigger the alarm. This would allow you to reduce the delay you observe.

Marcin
  • 215,873
  • 14
  • 235
  • 294
  • 2
    critical part: the comment in the linked S-O question: "ALB metric delay is due to an Ingestion delay time of 3 minutes and this delay cannot be reduced at this stage" + "AWS is working on it". THAT is definitely what I'm experiencing: almost a 3 minutes delay. Which is bad because many people will code this themselves: poll the ALB to try and get a "number of requests in past minute". Or code that inside their apps (count requests), which is something that should DEFINITELY be AWS-provided at the ALB level (access logs/metrics), not coded inside our apps... Great links! Thanks! – Pierre May 03 '21 at 20:02
  • For the record, in the Azure docs (to compare with AWS), they mention: "For Web Apps, the averaging period is much shorter, allowing new instances to be available in about five minutes after a change to the average trigger measure". Feels like AWS could win this. Feels like most of us could code it to scale out faster (if we could parse "fresh" access logs). Come on AWS! :) https://learn.microsoft.com/en-us/azure/architecture/best-practices/auto-scaling – Pierre May 03 '21 at 20:03
  • 1
    yes, of course, I just want to leave it open for a couple more days to incite more responses. Maybe from AWS who could maybe confirm that they are working on it and maybe get an ETA so that we don't all have to code this ourselves (they already provide ALB metrics, already push it to CloudWatch, and already have Cloudwatch alarms triggering ECS actions: they just need to make it higher resolution to avoid 1min wait for ALB metric to reach CW + another 1min wait for CW to trigger the alarm). – Pierre May 04 '21 at 14:19
  • @Pierre Yes, no problem. Thanks. – Marcin May 04 '21 at 23:20