1

AWS ECS cluster services do not start new tasks.

Already checked:

  • ECS EC2 instances are registered, active, full CPU and memory available, ECS agent is connected.
  • there are no events in ECS service "Events" tab, nothing about registering, starting, stopping, no errors, it's just empty.
  • Registered EC2 instances are set up correctly, in other cluster the same AMI is working perfect.
  • Task definition is correct, it used to work a day before and since then no changes happened.
  • Checked Service role contains all relevant policies

Querying ECS with AWS CLI aws ecs describe-services --services my-service --cluster my-cluster yields that deployment rollout is constantly IN_PROGRESS and stays like this. Full response with configuration is here (I've substituted real names and IDs):

{
        "serviceArn": "arn:aws:ecs:eu-central-1:my-account-id:service/my-cluster/my-service",
        "serviceName": "my-service",
        "clusterArn": "arn:aws:ecs:eu-central-1:my-account-id:cluster/my-cluster",
        "loadBalancers": [
            {
                "targetGroupArn": "arn:aws:elasticloadbalancing:eu-central-1:my-account-id:targetgroup/my-service-lb/load-balancer-id",
                "containerName": "my-service",
                "containerPort": 8065
            }
        ],
        "serviceRegistries": [
            {
                "registryArn": "arn:aws:servicediscovery:eu-central-1:my-account-id:service/srv-srv_id",
                "containerName": "my-service",
                "containerPort": 8065
            }
        ],
        "status": "ACTIVE",
        "desiredCount": 1,
        "runningCount": 0,
        "pendingCount": 0,
        "launchType": "EC2",
        "taskDefinition": "arn:aws:ecs:eu-central-1:my-account-id:task-definition/my-service:76",
        "deploymentConfiguration": {
            "deploymentCircuitBreaker": {
                "enable": false,
                "rollback": false
            },
            "maximumPercent": 200,
            "minimumHealthyPercent": 100
        },
        "deployments": [
            {
                "id": "ecs-svc/deployment_id",
                "status": "PRIMARY",
                "taskDefinition": "arn:aws:ecs:eu-central-1:my-account-id:task-definition/my-service:76",
                "desiredCount": 1,
                "pendingCount": 0,
                "runningCount": 0,
                "failedTasks": 0,
                "createdAt": "2022-06-28T09:15:08.241000+02:00",
                "updatedAt": "2022-06-28T09:15:08.241000+02:00",
                "launchType": "EC2",
                "rolloutState": "IN_PROGRESS",
                "rolloutStateReason": "ECS deployment ecs-svc/deployment_id in progress."
            }
        ],
        "roleArn": "arn:aws:iam::my-account-id:role/aws-service-role/ecs.amazonaws.com/AWSServiceRoleForECS",
        "events": [],
        "createdAt": "2022-06-28T09:15:08.241000+02:00",
        "placementConstraints": [],
        "placementStrategy": [
            {
                "type": "spread",
                "field": "attribute:ecs.availability-zone"
            }
        ],
        "healthCheckGracePeriodSeconds": 120,
        "schedulingStrategy": "REPLICA",
        "createdBy": "arn:aws:iam::my-account-id:role/my-role",
        "enableECSManagedTags": false,
        "propagateTags": "NONE",
        "enableExecuteCommand": false
    }

The ECS service and service discovery entry is created using Terraform, and the service definition is

resource "aws_service_discovery_service" "ecs_discovery_service" {
  name = var.service_name

  dns_config {
    namespace_id = var.service_discovery_hosted_zone_id

    dns_records {
      ttl  = 10
      type = "SRV"
    }
  }

  health_check_custom_config {
    failure_threshold = 1
  }
}

resource "aws_ecs_service" "ecs_service" {
    name                               = var.service_name
    cluster                            = var.ecs_cluster_id
    task_definition                    = var.task_definition_arn
    desired_count                      = var.desired_count
    deployment_minimum_healthy_percent = 100
    deployment_maximum_percent         = 200
    health_check_grace_period_seconds  = var.health_check_grace_period_seconds
    
    target_group_arn = aws_lb_target_group.target_group.arn
    container_name   = var.service_name
    container_port   = var.service_container_port

    
    ordered_placement_strategy {
        type  = "spread"
        field = "attribute:ecs.availability-zone"
    }

    service_registries {
        registry_arn   = aws_service_discovery_service.ecs_discovery_service.arn
        container_name = var.service_name
        container_port = var.service_container_port
    }
}

This code used to work pretty fine, and without any changes in infrastructure, after destroying and applying the infrastructure code, ECS does not start any new tasks.

I could narrow problem to the service discovery, as if I remove the service_registries section, the tasks are started as normal. Removing the service discovery solves the issue, however it's not the proper solution and I don't understand what is the reason of the problem.

Again, the Service Role has the permissions for the service discovery.

"servicediscovery:DeregisterInstance",
"servicediscovery:Get*",
"servicediscovery:List*",
"servicediscovery:RegisterInstance",
"servicediscovery:UpdateInstanceCustomHealthStatus"

I can't find any ways to trace this strange behaviour and want to ask you guys for help:

  • could you give me any hints what / where I could check. I've checked multiple troubleshooting guides, however all of them rely on events in ECS service and I don't have any there, anything else I had in mind is checked.
  • maybe you know what could be the problem that the service discovery blocks the ECS to start new tasks? I thought ECS adds a SRV record to the registry when it starts the container and the container is healthy, however I could not see that any containers have been started at all.

I would be very thankful for any hints and let me know if you need any details. Have a nice day and best regards.

Eugene
  • 307
  • 1
  • 3
  • 10
  • It looks like your service discovery container isn't able to fully start -- maybe there was a networking change that prevents it from connecting somewhere? – Parsifal Jun 28 '22 at 12:32
  • 1
    For debugging, I would turn to CloudTrail to verify that ECS is attempting to start the service (look for a `RunTask` event). You could also look at the "Stopped" tasks tab in the console to see if there's any information there. – Parsifal Jun 28 '22 at 12:34
  • Thanks for ideas, @Parsifal, nothing at stopped tasks too and no changes in the network configuration. I can only guess it's issue specifically to my AWS account and contacted support there. – Eugene Jun 28 '22 at 12:55
  • Also @Parsifal thanks for pointing me to CloudTrail, I could see some RunTask events, not sure I can find the issue however I will keep checking. – Eugene Jun 28 '22 at 12:57
  • @Eugene Could you find out what it is? I have a similar problem here, a task which worked fine before suddenly just stopped. No changes in config. Deleting the service and re-creating it, it just sits there dormant without any movement. No events in the events tab, no tasks, just nothing at all. – tom_w Sep 22 '22 at 07:44
  • Just answering my own comment from above. After trying to recreate the same task manually, I spotted an empty environment variable. It had a key, but no value. Removing it resolved this issue. – tom_w Sep 22 '22 at 09:05

0 Answers0