AWS ECS cluster services do not start new tasks.
Already checked:
- ECS EC2 instances are registered, active, full CPU and memory available, ECS agent is connected.
- there are no events in ECS service "Events" tab, nothing about registering, starting, stopping, no errors, it's just empty.
- Registered EC2 instances are set up correctly, in other cluster the same AMI is working perfect.
- Task definition is correct, it used to work a day before and since then no changes happened.
- Checked Service role contains all relevant policies
Querying ECS with AWS CLI aws ecs describe-services --services my-service --cluster my-cluster
yields that deployment rollout is constantly IN_PROGRESS
and stays like this.
Full response with configuration is here (I've substituted real names and IDs):
{
"serviceArn": "arn:aws:ecs:eu-central-1:my-account-id:service/my-cluster/my-service",
"serviceName": "my-service",
"clusterArn": "arn:aws:ecs:eu-central-1:my-account-id:cluster/my-cluster",
"loadBalancers": [
{
"targetGroupArn": "arn:aws:elasticloadbalancing:eu-central-1:my-account-id:targetgroup/my-service-lb/load-balancer-id",
"containerName": "my-service",
"containerPort": 8065
}
],
"serviceRegistries": [
{
"registryArn": "arn:aws:servicediscovery:eu-central-1:my-account-id:service/srv-srv_id",
"containerName": "my-service",
"containerPort": 8065
}
],
"status": "ACTIVE",
"desiredCount": 1,
"runningCount": 0,
"pendingCount": 0,
"launchType": "EC2",
"taskDefinition": "arn:aws:ecs:eu-central-1:my-account-id:task-definition/my-service:76",
"deploymentConfiguration": {
"deploymentCircuitBreaker": {
"enable": false,
"rollback": false
},
"maximumPercent": 200,
"minimumHealthyPercent": 100
},
"deployments": [
{
"id": "ecs-svc/deployment_id",
"status": "PRIMARY",
"taskDefinition": "arn:aws:ecs:eu-central-1:my-account-id:task-definition/my-service:76",
"desiredCount": 1,
"pendingCount": 0,
"runningCount": 0,
"failedTasks": 0,
"createdAt": "2022-06-28T09:15:08.241000+02:00",
"updatedAt": "2022-06-28T09:15:08.241000+02:00",
"launchType": "EC2",
"rolloutState": "IN_PROGRESS",
"rolloutStateReason": "ECS deployment ecs-svc/deployment_id in progress."
}
],
"roleArn": "arn:aws:iam::my-account-id:role/aws-service-role/ecs.amazonaws.com/AWSServiceRoleForECS",
"events": [],
"createdAt": "2022-06-28T09:15:08.241000+02:00",
"placementConstraints": [],
"placementStrategy": [
{
"type": "spread",
"field": "attribute:ecs.availability-zone"
}
],
"healthCheckGracePeriodSeconds": 120,
"schedulingStrategy": "REPLICA",
"createdBy": "arn:aws:iam::my-account-id:role/my-role",
"enableECSManagedTags": false,
"propagateTags": "NONE",
"enableExecuteCommand": false
}
The ECS service and service discovery entry is created using Terraform, and the service definition is
resource "aws_service_discovery_service" "ecs_discovery_service" {
name = var.service_name
dns_config {
namespace_id = var.service_discovery_hosted_zone_id
dns_records {
ttl = 10
type = "SRV"
}
}
health_check_custom_config {
failure_threshold = 1
}
}
resource "aws_ecs_service" "ecs_service" {
name = var.service_name
cluster = var.ecs_cluster_id
task_definition = var.task_definition_arn
desired_count = var.desired_count
deployment_minimum_healthy_percent = 100
deployment_maximum_percent = 200
health_check_grace_period_seconds = var.health_check_grace_period_seconds
target_group_arn = aws_lb_target_group.target_group.arn
container_name = var.service_name
container_port = var.service_container_port
ordered_placement_strategy {
type = "spread"
field = "attribute:ecs.availability-zone"
}
service_registries {
registry_arn = aws_service_discovery_service.ecs_discovery_service.arn
container_name = var.service_name
container_port = var.service_container_port
}
}
This code used to work pretty fine, and without any changes in infrastructure, after destroying and applying the infrastructure code, ECS does not start any new tasks.
I could narrow problem to the service discovery, as if I remove the service_registries section, the tasks are started as normal. Removing the service discovery solves the issue, however it's not the proper solution and I don't understand what is the reason of the problem.
Again, the Service Role has the permissions for the service discovery.
"servicediscovery:DeregisterInstance",
"servicediscovery:Get*",
"servicediscovery:List*",
"servicediscovery:RegisterInstance",
"servicediscovery:UpdateInstanceCustomHealthStatus"
I can't find any ways to trace this strange behaviour and want to ask you guys for help:
- could you give me any hints what / where I could check. I've checked multiple troubleshooting guides, however all of them rely on events in ECS service and I don't have any there, anything else I had in mind is checked.
- maybe you know what could be the problem that the service discovery blocks the ECS to start new tasks? I thought ECS adds a SRV record to the registry when it starts the container and the container is healthy, however I could not see that any containers have been started at all.
I would be very thankful for any hints and let me know if you need any details. Have a nice day and best regards.