Cloudformation ECS Fargate autoscaling target tracking: 1 custom alarm in 1 minute: Failed to execute action

Question

I can setup the following through the console and I have it as a cloudformation template as well:

a scalable target associated to my ALB,
a CPU target tracking scaling policy,
a ALBRequestCountPerTarget target tracking policy.

This all work really well. Creating the policies in my Cloudformation template also takes care of creating the associated scale out and scale in alarms.

The problem: the autocreated alarms only trigger after 3 alarms went off in previous 3 periods of 60 seconds. So, if a sudden load comes in, it will take 3 minutes for the ECS cluster's service to scale out. This is much too long for me. I want it to scale as fast as possible. And, from the documentation, it seems like the smallest period for the ALB RequestCountPerTarget is 60 seconds:

Only a period greater than 60s is supported for metrics in the "AWS/" namespaces

Manual solution: right now, I can go to the console, in the cloudwatch service, find the HIGH and LOW alarms created for me, and edit the HIGH alarm (the one that triggers a scale out). So I can set the alarm's evaluation "period" to 60 seconds, "DatapointsToAlarm" to 1 (as soon as the alarm goes off, trigger the scale out action), "EvaluationPeriods" to 1 (take into account the previous 60s period only), and "Threshold" to 500 (if there are more than 500 requests on my ALB in the past 60 seconds, add capacity=scale out).

To test, I use JMeter and send a ton of requests and I can see that the alarm goes off in a minute or so and my ECS service adjusts the desired count of running tasks. This all works very well.

But now, we're all supposed to write infrastructure as code (IaC), right? So, I want the above console tweaks to be included in my CloudFormation templates. And that's where the problems occur. What I've done:

I added two new alarms in my Cloudformation template: one for HIGH (scale out) and one for LOW (scale in),
the two new alarms point to the existing scaling policy.

What happens:

the alarms DO go into ALARM state within 60 seconds,
the policy tries to run the action (scale out) but there is an error:

Failed to execute action
arn:aws:autoscaling:us-east-1:MY-ACCOUNT-ID:scalingPolicy:51f0a780-28d5-4005-9681-84244912954d:resource/ecs/service/my-ecs-cluster/my-ecs-service:policyName/alb-requests-per-target-per-minute. Received error: ""

I have no idea what that means other than to tell me: alarm was raised correctly, we tried to act on it and scale out, but it failed (and no error provided!).

I tried to compare this error with the other successful actions [the actions from the alarms automatically created by AWS when it creates the autoscaling policy=the ones that trigger at t=3 minutes], and the only difference I see is that the ARN to the action in the error message seems to be missing the "createdBy" that seems to be appended to the action ARNs when the "default" alarms (the ones automatically provisioned when cloudformation creates the autoscaling policy) trigger the action:

Successfully executed action arn:aws:autoscaling:us-east-1:MY-ACCOUNT-ID:scalingPolicy:51f0a780-28d5-4005-9681-84244912954d:resource/ecs/service/my-ecs-cluster-cce/my-ecs-service:policyName/tpg-cce-cpu-target-tracking-scaling-policy:createdBy/59b3e5ac-81ae-490f-8ecb-00241506a15e

Successfully executed action arn:aws:autoscaling:us-east-1:MY-ACCOUNT-ID:scalingPolicy:51f0a780-28d5-4005-9681-84244912954d:resource/ecs/service/my-ecs-cluster-cce/my-ecs-service:policyName/alb-requests-per-target-per-minute:createdBy/ffacb0ac-2456-4751-b9c0-b909c66e9868

Note the difference above (the missing createdBy in the policy ARN where the action is triggered by my custom alarms). But I have no idea how to get that because, in CloudFormation, I Fn::Ref to the policy ARN, there is no mention of appending some sort of "createdBy" at the end of the policy ARN (note that this might not be the issue at all, I'm just listing what I found so far and that is the only difference I found so far = this might be a red herring/false lead).

Another clue, maybe, is that, when I go to the AWS console in Cloudwatch to look at the alarms:

I CAN edit the Cloudwatch HIGH alarm that AWS created automatically when I create my policy,
I CANNOT edit my custom "scale out" alarm (the one at the bottom of my CloudFormation template below). The error in the console when I tried to edit my custom alarm is:

Cannot edit myCloudFormationStack-ALBRequestsScaleOutAlarm-E4VY9ZOJ5DOF as it is an Auto Scaling alarm with target tracking scaling policies.

Yet another difference between my alarm and the one automatically created with the policy: when I look in the AWS console, in CloudWatch->Alarms, and I look at the alarms details, the "Actions" section looks different. For the automatically provisioned alarm, I see this:

When in alarm, execute this action arn:aws:autoscaling:us-east-1:MYACCOUNTID:scalingPolicy:51f0a780-28d5-4005-9681-84244912954d:resource/ecs/service/my-ecs-cluster-cce/my-service:policyName/alb-requests-per-target-per-minute:createdBy/ffacb0ac-2456-4751-b9c0-b909c66e9868

But for my own alarm (defined in the CloudFormation template below), I see this in the alarm details (Actions):

When in alarm, use policy alb-requests-per-target-per-minute (Maintain the metric ALBRequestCountPerTarget at the target value of 1000.)

Here is my full CloudFormation template:

AWSTemplateFormatVersion: '2010-09-09'
Description: ECS task definition, service, and hooks it up to the ALB via a Target Group

# IMPORTANT: this needs the first Cloudformation layers in place (see the imports below)

Parameters:
  ContainerImageIdParam:
    Description: The ECR container image ID and tag to deploy
    Type: String
    Default: MYACCOUNTID.dkr.ecr.us-east-1.amazonaws.com/myapp:v10

  JDBCUrlParam:
    Description: The JDBC URL to the RDS database (use the Route53 DNS entry to your database, and NOT the AWS URL)
    Type: String
    Default: jdbc-secretsmanager:mysql://mysql.MYPIERREDNS.org:3306/MYDATABASE?useSSL=false&allowPublicKeyRetrieval=true&serverTimezone=UTC

Resources:
  Task:
    Type: AWS::ECS::TaskDefinition
    Properties:
      Family: myapp
      Cpu: 512
      Memory: 1024
      NetworkMode: awsvpc
      RequiresCompatibilities:
        - FARGATE
      ExecutionRoleArn: !ImportValue ECSTaskExecutionRole
      taskRoleArn: !ImportValue ECSTaskRole
      ContainerDefinitions:
        - Name: myapp-container
          Image: !Ref ContainerImageIdParam
          Cpu: 512
          Memory: 1024
          environment:
            - name: JDBC_DB_URL
              value: !Ref JDBCUrlParam
            - name: JDBC_DB_DRIVER_CLASS
              value: com.amazonaws.secretsmanager.sql.AWSSecretsManagerMySQLDriver
            - name: JDBC_DB_USERNAME
              value: dev/myapp/mysql
            - name: JDBC_DB_PASSWORD
              value: notUsedButECSWillErrorIfMissing
            - name: DB_NUM_THREADS
              value: 10
          PortMappings:
            - ContainerPort: 9000
              Protocol: tcp
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-group: /ecs/myapp
              awslogs-region: !Ref AWS::Region
              awslogs-stream-prefix: myapp-app

  Service:
    Type: AWS::ECS::Service
    DependsOn: ListenerRule
    Properties:
      ServiceName: myapp-service # todo if we remove the name, one will be automatically be generated
      TaskDefinition: !Ref Task
      Cluster: !ImportValue ECSCluster
      LaunchType: FARGATE
      DesiredCount: 1 # set this to 0 if cloudformation has issues creating this stack (otherwise takes 3 hours and then fails/timeouts)
      DeploymentConfiguration:
        MaximumPercent: 200
        MinimumHealthyPercent: 0 
      HealthCheckGracePeriodSeconds: 30
      NetworkConfiguration:
        AwsvpcConfiguration:
          AssignPublicIp: ENABLED
          Subnets:
            - !ImportValue PrivateSubnet1
            - !ImportValue PrivateSubnet2
          SecurityGroups:
            - !ImportValue ECSServiceSecurityGroup
      LoadBalancers:
        - ContainerName: myapp-container
          ContainerPort: 9000
          TargetGroupArn: !Ref TargetGroup

  TargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      Name: myapp-tg
      VpcId: !ImportValue VPC
      Port: 9000
      Protocol: HTTP
      Matcher:
        HttpCode: 200-299
      HealthCheckIntervalSeconds: 30
      HealthCheckPath: /myapp-svc/index.html
      HealthCheckProtocol: HTTP
      HealthCheckTimeoutSeconds: 10
      HealthyThresholdCount: 2
      UnhealthyThresholdCount: 6
      TargetType: ip
      TargetGroupAttributes:
        - Key: deregistration_delay.timeout_seconds
          Value: 30

  ListenerRule:
    Type: AWS::ElasticLoadBalancingV2::ListenerRule
    Properties:
      ListenerArn: !ImportValue LoadBalancerListenerHTTPS
      Priority: 20
      Conditions:
        - Field: path-pattern
          Values:
            - /myapp-svc/*
      Actions:
        - TargetGroupArn: !Ref TargetGroup
          Type: forward

  ECSAutoScalingTarget:
    Type: AWS::ApplicationAutoScaling::ScalableTarget
    Properties:
      MaxCapacity: 6
      MinCapacity: 1
      ResourceId: !Join ["/", [service, !ImportValue ECSCluster, !GetAtt Service.Name]] # service/clusterName/serviceName = service/ecs-cluster-myapp/myapp-service
      RoleARN: !Sub 'arn:aws:iam::${AWS::AccountId}:role/aws-service-role/ecs:application-autoscaling:amazonaws:com/AWSServiceRoleForApplicationAutoScaling_ECSService'
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs

  CPUUtilizationAutoScalingPolicy:
    Type: AWS::ApplicationAutoScaling::ScalingPolicy
    Properties:
      PolicyName: myapp-cpu-target-tracking-scaling-policy
      PolicyType: TargetTrackingScaling
      ScalingTargetId: !Ref ECSAutoScalingTarget
      TargetTrackingScalingPolicyConfiguration:
        DisableScaleIn: true # disable scale in for this policy to give ALBRequestPolicy the priority on scale in decisions
        PredefinedMetricSpecification:
          PredefinedMetricType: ECSServiceAverageCPUUtilization
        ScaleInCooldown: 300
        ScaleOutCooldown: 30
        TargetValue: 50 # Average 50% CPU utilization


  ServiceScalingPolicyALB:
    Type: AWS::ApplicationAutoScaling::ScalingPolicy
    Properties:
      PolicyName: alb-requests-per-target-per-minute
      PolicyType: TargetTrackingScaling
      ScalingTargetId: !Ref ECSAutoScalingTarget
      TargetTrackingScalingPolicyConfiguration:
        TargetValue: 1000
        ScaleInCooldown: 300
        ScaleOutCooldown: 30
        PredefinedMetricSpecification:
          PredefinedMetricType: ALBRequestCountPerTarget
          ResourceLabel: !Join
            - '/'
            - - !ImportValue ECSLoadBalancerFullName
              - !GetAtt TargetGroup.TargetGroupFullName

  # NOTE: the ALB RequestCountPerTarget metric alarms are automatically
  # created when we use that policy. But if we want a different evaluation period,
  # we need to define our own alarms. So, the new scale IN/OUT alarms are included below.

  # SCALE OUT ALARM: if the total (SUM) of ALB requests per target is ABOVE the
  # threshold a certain number of times in the past period, THEN send "scale out" alarm.
  ALBRequestsScaleOutAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      MetricName: RequestCountPerTarget
      Namespace: AWS/ApplicationELB # Only a period greater than 60s is supported for metrics in the "AWS/" namespaces
      ActionsEnabled: true
      AlarmActions:
        - !Ref ServiceScalingPolicyALB
      # OKActions: []
      # InsufficientDataActions: []
      Statistic: Sum
      Dimensions:
        - Name: LoadBalancer
          Value: !ImportValue ECSLoadBalancerFullName
        - Name: TargetGroup
          Value: !GetAtt TargetGroup.TargetGroupFullName
      Period: 60     # evaluation period (in seconds) = 1 datapoint
      EvaluationPeriods: 1 # number of previous periods to take into account
      DatapointsToAlarm: 1 # number of datapoints above threshold needed to generate an alarm
      Threshold: 1000  # alarm threshold: more than 1000 requests
      Unit: None
      ComparisonOperator: GreaterThanThreshold

  # SCALE IN ALARM: if the total (SUM) of ALB requests per target is BELOW the
  # threshold a certain number of times in the past period, THEN send "scale in" alarm.
  ALBRequestsScaleInAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      MetricName: RequestCountPerTarget
      Namespace: AWS/ApplicationELB
      Statistic: Sum
      Period: 60     # evaluation period (in seconds) = 1 datapoint
      EvaluationPeriods: 5 # number of previous periods to take into account
      DatapointsToAlarm: 1 # number of datapoints above threshold needed to generate an alarm
      Threshold: 500  # alarm threshold: less than 500 requests
      Unit: None
      AlarmActions:
        - !Ref ServiceScalingPolicyALB
      OKActions:
        - !Ref ServiceScalingPolicyALB
      Dimensions:
        - Name: LoadBalancer
          Value: !ImportValue ECSLoadBalancerFullName
        - Name: TargetGroup
          Value: !GetAtt TargetGroup.TargetGroupFullName
      ComparisonOperator: LessThanThreshold

Q1: what I am doing wrong? How do I specify an ACTION in CloudFormation so that my alarm triggers the same action as the one triggered by the autoprovisioned AWS alarms (the ones that get automatically created when I create my policy)?

Q2: is there a way to look at the ACTIONs in the AWS console? I'm guessing that AWS hides these things under the covers (maybe lambda or other?).

Q3: does anyone have another way to do this? Maybe with step scaling? I'm willing to trigger below 60 seconds as well so maybe I should move away from target tracking?

If anyone has a working sample of a CloudFormation template that triggers in one minute or less based on the number of requests to the ALB, that would definitely be AWESOME :) I put that in a separate question (how to trigger in less than one minute): ECS Fargate autoscaling more rapidly?

Cloudformation ECS Fargate autoscaling target tracking: 1 custom alarm in 1 minute: Failed to execute action

0 Answers0

Linked