AWS ECS does not start new tasks

Question

AWS ECS cluster services do not start new tasks.

Already checked:

ECS EC2 instances are registered, active, full CPU and memory available, ECS agent is connected.
there are no events in ECS service "Events" tab, nothing about registering, starting, stopping, no errors, it's just empty.
Registered EC2 instances are set up correctly, in other cluster the same AMI is working perfect.
Task definition is correct, it used to work a day before and since then no changes happened.
Checked Service role contains all relevant policies

Querying ECS with AWS CLI aws ecs describe-services --services my-service --cluster my-cluster yields that deployment rollout is constantly IN_PROGRESS and stays like this. Full response with configuration is here (I've substituted real names and IDs):

{
        "serviceArn": "arn:aws:ecs:eu-central-1:my-account-id:service/my-cluster/my-service",
        "serviceName": "my-service",
        "clusterArn": "arn:aws:ecs:eu-central-1:my-account-id:cluster/my-cluster",
        "loadBalancers": [
            {
                "targetGroupArn": "arn:aws:elasticloadbalancing:eu-central-1:my-account-id:targetgroup/my-service-lb/load-balancer-id",
                "containerName": "my-service",
                "containerPort": 8065
            }
        ],
        "serviceRegistries": [
            {
                "registryArn": "arn:aws:servicediscovery:eu-central-1:my-account-id:service/srv-srv_id",
                "containerName": "my-service",
                "containerPort": 8065
            }
        ],
        "status": "ACTIVE",
        "desiredCount": 1,
        "runningCount": 0,
        "pendingCount": 0,
        "launchType": "EC2",
        "taskDefinition": "arn:aws:ecs:eu-central-1:my-account-id:task-definition/my-service:76",
        "deploymentConfiguration": {
            "deploymentCircuitBreaker": {
                "enable": false,
                "rollback": false
            },
            "maximumPercent": 200,
            "minimumHealthyPercent": 100
        },
        "deployments": [
            {
                "id": "ecs-svc/deployment_id",
                "status": "PRIMARY",
                "taskDefinition": "arn:aws:ecs:eu-central-1:my-account-id:task-definition/my-service:76",
                "desiredCount": 1,
                "pendingCount": 0,
                "runningCount": 0,
                "failedTasks": 0,
                "createdAt": "2022-06-28T09:15:08.241000+02:00",
                "updatedAt": "2022-06-28T09:15:08.241000+02:00",
                "launchType": "EC2",
                "rolloutState": "IN_PROGRESS",
                "rolloutStateReason": "ECS deployment ecs-svc/deployment_id in progress."
            }
        ],
        "roleArn": "arn:aws:iam::my-account-id:role/aws-service-role/ecs.amazonaws.com/AWSServiceRoleForECS",
        "events": [],
        "createdAt": "2022-06-28T09:15:08.241000+02:00",
        "placementConstraints": [],
        "placementStrategy": [
            {
                "type": "spread",
                "field": "attribute:ecs.availability-zone"
            }
        ],
        "healthCheckGracePeriodSeconds": 120,
        "schedulingStrategy": "REPLICA",
        "createdBy": "arn:aws:iam::my-account-id:role/my-role",
        "enableECSManagedTags": false,
        "propagateTags": "NONE",
        "enableExecuteCommand": false
    }

The ECS service and service discovery entry is created using Terraform, and the service definition is

resource "aws_service_discovery_service" "ecs_discovery_service" {
  name = var.service_name

  dns_config {
    namespace_id = var.service_discovery_hosted_zone_id

    dns_records {
      ttl  = 10
      type = "SRV"
    }
  }

  health_check_custom_config {
    failure_threshold = 1
  }
}

resource "aws_ecs_service" "ecs_service" {
    name                               = var.service_name
    cluster                            = var.ecs_cluster_id
    task_definition                    = var.task_definition_arn
    desired_count                      = var.desired_count
    deployment_minimum_healthy_percent = 100
    deployment_maximum_percent         = 200
    health_check_grace_period_seconds  = var.health_check_grace_period_seconds
    
    target_group_arn = aws_lb_target_group.target_group.arn
    container_name   = var.service_name
    container_port   = var.service_container_port

    
    ordered_placement_strategy {
        type  = "spread"
        field = "attribute:ecs.availability-zone"
    }

    service_registries {
        registry_arn   = aws_service_discovery_service.ecs_discovery_service.arn
        container_name = var.service_name
        container_port = var.service_container_port
    }
}

This code used to work pretty fine, and without any changes in infrastructure, after destroying and applying the infrastructure code, ECS does not start any new tasks.

I could narrow problem to the service discovery, as if I remove the service_registries section, the tasks are started as normal. Removing the service discovery solves the issue, however it's not the proper solution and I don't understand what is the reason of the problem.

Again, the Service Role has the permissions for the service discovery.

"servicediscovery:DeregisterInstance",
"servicediscovery:Get*",
"servicediscovery:List*",
"servicediscovery:RegisterInstance",
"servicediscovery:UpdateInstanceCustomHealthStatus"

I can't find any ways to trace this strange behaviour and want to ask you guys for help:

could you give me any hints what / where I could check. I've checked multiple troubleshooting guides, however all of them rely on events in ECS service and I don't have any there, anything else I had in mind is checked.
maybe you know what could be the problem that the service discovery blocks the ECS to start new tasks? I thought ECS adds a SRV record to the registry when it starts the container and the container is healthy, however I could not see that any containers have been started at all.

I would be very thankful for any hints and let me know if you need any details. Have a nice day and best regards.

It looks like your service discovery container isn't able to fully start -- maybe there was a networking change that prevents it from connecting somewhere? — Parsifal, Jun 28 '22 at 12:32
For debugging, I would turn to CloudTrail to verify that ECS is attempting to start the service (look for a `RunTask` event). You could also look at the "Stopped" tasks tab in the console to see if there's any information there. — Parsifal, Jun 28 '22 at 12:34
Thanks for ideas, @Parsifal, nothing at stopped tasks too and no changes in the network configuration. I can only guess it's issue specifically to my AWS account and contacted support there. — Eugene, Jun 28 '22 at 12:55
Also @Parsifal thanks for pointing me to CloudTrail, I could see some RunTask events, not sure I can find the issue however I will keep checking. — Eugene, Jun 28 '22 at 12:57
@Eugene Could you find out what it is? I have a similar problem here, a task which worked fine before suddenly just stopped. No changes in config. Deleting the service and re-creating it, it just sits there dormant without any movement. No events in the events tab, no tasks, just nothing at all. — tom_w, Sep 22 '22 at 07:44
Just answering my own comment from above. After trying to recreate the same task manually, I spotted an empty environment variable. It had a key, but no value. Removing it resolved this issue. — tom_w, Sep 22 '22 at 09:05

AWS ECS does not start new tasks

0 Answers0