2

I am using Terraform to set up a small Fargate cluster of three apache server tasks. The tasks hang on pending, and then the cluster stops them and creates new pending tasks, and the cycle continues.

The AWS docs say it could be because:

  • The Docker daemon is unresponsive

The docs say to setup CloudWatch to see CPU usage and increase container size if needed. I have upped both the CPU/memory to 1024/2048, which didn't fix the problem.

  • The Docker image is large

Unlikely? The image is nothing but httpd:2.4

  • The ECS container agent lost connectivity with the Amazon ECS service in the middle of a task launch

The docs provide some commands to run in the container instance. To do this it looks like I have to either set up AWS Systems Manager or SSH in directly. I will take this route if I can't find any problems with my Terraform config.

  • The ECS container agent takes a long time to stop an existing task

Unlikely because I am launching a completely new ECS cluster


Below are the ECS and IAM sections of my Terraform file. Why might my Fargate tasks be stuck on pending?

#
# ECS
#
resource "aws_ecs_cluster" "main" {
  name = "main-ecs-cluster"
}

resource "aws_ecs_task_definition" "app" {
  family                   = "app"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = 256
  memory                   = 512
  execution_role_arn       = aws_iam_role.task_execution.arn
  task_role_arn            = aws_iam_role.task_execution.arn
  container_definitions = <<DEFINITION
  [
    {
      "image": "httpd:2.4",
      "cpu": 256,
      "memory": 512,
      "name": "app",
      "networkMode": "awsvpc",
      "portMappings": [
        {
          "containerPort": 80,
          "hostPort": 80,
          "protocol": "tcp"
        }
      ]
    }
  ]
  DEFINITION
}

resource "aws_ecs_service" "main" {
  name            = "tf-ecs-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 2
  launch_type     = "FARGATE"

  network_configuration {
    security_groups = [aws_security_group.main.id]
    subnets         = [
      aws_subnet.public1.id,
      aws_subnet.public2.id,
    ]
  }
}

#
# IAM
#
resource "aws_iam_role" "task_execution" {
  name               = "my-first-service-task-execution-role"
  assume_role_policy = data.aws_iam_policy_document.task_execution.json
}

data "aws_iam_policy_document" "task_execution" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["ecs-tasks.amazonaws.com"]
    }
  }
}

resource "aws_iam_role_policy_attachment" "task_execution" {
  role       = aws_iam_role.task_execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
brietsparks
  • 4,776
  • 8
  • 35
  • 69
  • 1
    Could be many reasons, wrong credentails, no connection to container registry to pull the image (e.g. ecr). If you go to ecs console, go to task or service there should be some message or info on why it fails to launch. Have you check the ecs concole and tasks for any messages? – Marcin May 23 '20 at 04:45
  • In the ECS console I see `Stopped reason: Task failed to start` – brietsparks May 23 '20 at 05:02
  • 1
    But if you go to details, like on this [screenshot](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/images/stopped_container_status_reason.png) usually there is more info there. There is nothing more in your case? – Marcin May 23 '20 at 05:15
  • Since you use `awsvpc` check if you enabled public ip for them (assuming you run your ecs service in public subnet). Also which container instances are you referent to? fargate does not have them for you to login or execute any commands on them. – Marcin May 23 '20 at 05:29
  • I got `CannotPullContainerError: Error response from daem`. I don't have an ECR instance. I'm guessing that could be why – brietsparks May 23 '20 at 05:37
  • `httpd` is pulling from docker hub. Seems it cant connect to the hub. Do you have internet access setup, such as public ip for the task? – Marcin May 23 '20 at 05:38
  • 1
    No the tasks are in a private subnet. Look like my options are public subnet, NAT gateway, or pull from an image that exists in an ECR instance. [source](https://aws.amazon.com/blogs/compute/task-networking-in-aws-fargate/) – brietsparks May 23 '20 at 05:55
  • So this is the reason (at least the apparent one) why it does not work. And yes, you are correct. These are the options. If you don't mind I will provide the answer to clarify the issue is lack of internet access. – Marcin May 23 '20 at 05:57
  • That's fine, and thank you very much for the help. I'll give this a shot and accept the answer if it works, or else comment with an update if otherwise – brietsparks May 23 '20 at 06:01
  • No problem and thanks as well. – Marcin May 23 '20 at 06:03

2 Answers2

4

Based on the discussion in the comments it was determined that the issue is caused by the lack of internet access for the Fargate tasks.

This is because the tasks run in a private subnet, while task use httpd image from docker hub. Pulling images from the hub requires internet access.

Possible solutions are use of NAT gateway/instance, using tasks in the public subnet or having custom image in ECR..

Marcin
  • 215,873
  • 14
  • 235
  • 294
  • 3
    public subnet + [`assign_public_ip`](https://www.terraform.io/docs/providers/aws/r/ecs_service.html#network_configuration-1) fixed the issue... thanks again! – brietsparks May 23 '20 at 06:19
4

Public subnet / public IP may not be correct solution for many security reasons.

Consider placing your tasks in private subnets.

  1. You will be able to pull images if you configure connection to the internet through NAT pulling image from ECR using routing through NAT gateway

or you can use BETTER solution:

  1. Your ECS FARGATE can pull images from ECR even if you place in PRIVATE subnet without connection to the internet. Please check AWS PrivateLink for ECR diagram: pulling image from ECS using PrivateLink - VPC endpoints
szyjek
  • 581
  • 1
  • 4
  • 5