-1

I have changed my CDK deployment code to make it more modular. And so I have moved the Task definition and FargateService code into a separate class EcsService. After making these changes, the stack deployment is stuck due to ECS. And the reason is that the taskdef is not able to fetch the image due to some permission or network issue. The error is shown below. And my old and new code are below the error message.

Error

Task stopped at: 2023-08-31T05:55:55.882Z
ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post "https://api.ecr.us-east-1.amazonaws.com/": dial tcp 44.213.79.50:443: i/o timeout. Please check your task network configuration.

Old Code

securityGroup.addIngressRule(ec2.Peer.anyIpv4(), ec2.Port.tcp(3000));

// Validation
if (!envJSON.ssdDockerImageTag ) {
  throw new Error('Missing ssd-fe image tag.');
}
const cluster = new ecs.Cluster(this, "ssdCluster", { vpc });

// Define the task definition with a container using an image from ECR
const taskDefinition = new ecs.FargateTaskDefinition(this, 'ssdTaskDef');
const container = taskDefinition.addContainer('ssdContainer', {
  image: ecs.ContainerImage.fromEcrRepository(
    ecr.Repository.fromRepositoryName(this, 'ssdRepo', 'ssd-fe'),
    envJSON.ssdDockerImageTag),
  memoryLimitMiB: 512,
  cpu: 256,
  portMappings: [{
    containerPort: 3000
  }],
  environment: {
    NODE_ENV: "production",
    API_BASE_URL: api.url
  }
});

// Create the Fargate Service
const service = new ecs.FargateService(this, 'ssdService', {
  cluster,
  taskDefinition,
  desiredCount: 1,
  vpcSubnets: {
    subnetType: ec2.SubnetType.PUBLIC,
  },
  securityGroups: [securityGroup],
  assignPublicIp: true,
});

LoadBalancer.getInstance(this, 'LoadBalancer', {
  vpc,
  ecsService: service,
});

New code

securityGroup.addIngressRule(ec2.Peer.anyIpv4(), ec2.Port.tcp(3000));

// Validation
if (!envJSON.ssdDockerImageTag ) {
  throw new Error('Missing ssd-fe image tag.');
}
const cluster = new ecs.Cluster(this, "ssdCluster", { vpc });

// Create ECS Service
const ecsService = new EcsService(this, 'ssdService', {
  vpc,
  securityGroup: securityGroup,
  cluster: cluster,
  repoName: 'ssd-fe',
  imageTag: envJSON.ssdDockerImageTag,
  environment: {
    NODE_ENV: "production",
    API_BASE_URL: api.url
  }
});

LoadBalancer.getInstance(this, 'LoadBalancer', {
  vpc,
  ecsService: ecsService.service,
});

// EcsService class
import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as ecr from 'aws-cdk-lib/aws-ecr';
import * as logs from 'aws-cdk-lib/aws-logs';
import * as iam from 'aws-cdk-lib/aws-iam';
import { Construct } from 'constructs';

interface EcsServiceProps {
  vpc: ec2.IVpc;
  securityGroup: ec2.ISecurityGroup;
  cluster: ecs.ICluster;
  repoName: string;
  imageTag: string;
  environment?: { [key: string]: string };
}

export class EcsService extends Construct {
  public readonly service: ecs.FargateService;

  constructor(scope: Construct, id: string, props: EcsServiceProps) {
    super(scope, id);
    
    const ecrRepository = ecr.Repository.fromRepositoryName(this, `${id}Repo`, props.repoName);
    
    const taskDefinition = new ecs.FargateTaskDefinition(this, `${id}TaskDef`);
    taskDefinition.addContainer(`${id}Container`, {
      image: ecs.ContainerImage.fromEcrRepository(ecrRepository, props.imageTag),
      memoryLimitMiB: 512,
      cpu: 256,
      portMappings: [{ containerPort: 3000 }],
      environment: props.environment,
    });

    this.service = new ecs.FargateService(this, id, {
      cluster: props.cluster,
      taskDefinition,
      desiredCount: 1,
      vpcSubnets: { subnetType: ec2.SubnetType.PUBLIC },
      securityGroups: [props.securityGroup],
    });
  }
}

IAM Statement Changes

┌───┬───────────────────────────────────────┬────────┬───────────────────────────────────────┬───────────────────────────────────────┬───────────┐
│   │ Resource                              │ Effect │ Action                                │ Principal                             │ Condition │
├───┼───────────────────────────────────────┼────────┼───────────────────────────────────────┼───────────────────────────────────────┼───────────┤
│ - │ *                                     │ Allow  │ ecr:GetAuthorizationToken             │ AWS:${ssdTaskDefExecutionRole469C7625 │           │
│   │                                       │        │                                       │ }                                     │           │
├───┼───────────────────────────────────────┼────────┼───────────────────────────────────────┼───────────────────────────────────────┼───────────┤
│ - │ arn:aws:ecr:us-east-1:533732470418:re │ Allow  │ ecr:BatchCheckLayerAvailability       │ AWS:${ssdTaskDefExecutionRole469C7625 │           │
│   │ pository/ssd-fe                       │        │ ecr:BatchGetImage                     │ }                                     │           │
│   │                                       │        │ ecr:GetDownloadUrlForLayer            │                                       │           │
├───┼───────────────────────────────────────┼────────┼───────────────────────────────────────┼───────────────────────────────────────┼───────────┤
│ + │ ${ssdService/ssdServiceTaskDef/Execut │ Allow  │ sts:AssumeRole                        │ Service:ecs-tasks.amazonaws.com       │           │
│   │ ionRole.Arn}                          │        │                                       │                                       │           │
├───┼───────────────────────────────────────┼────────┼───────────────────────────────────────┼───────────────────────────────────────┼───────────┤
│ + │ ${ssdService/ssdServiceTaskDef/TaskRo │ Allow  │ sts:AssumeRole                        │ Service:ecs-tasks.amazonaws.com       │           │
│   │ le.Arn}                               │        │                                       │                                       │           │
├───┼───────────────────────────────────────┼────────┼───────────────────────────────────────┼───────────────────────────────────────┼───────────┤
│ + │ *                                     │ Allow  │ ecr:GetAuthorizationToken             │ AWS:${ssdService/ssdServiceTaskDef/Ex │           │
│   │                                       │        │                                       │ ecutionRole}                          │           │
├───┼───────────────────────────────────────┼────────┼───────────────────────────────────────┼───────────────────────────────────────┼───────────┤
│ + │ arn:aws:ecr:us-east-1:533732470418:re │ Allow  │ ecr:BatchCheckLayerAvailability       │ AWS:${ssdService/ssdServiceTaskDef/Ex │           │
│   │ pository/ssd-fe                       │        │ ecr:BatchGetImage                     │ ecutionRole}                          │           │
│   │                                       │        │ ecr:GetDownloadUrlForLayer            │                                       │           │
└───┴───────────────────────────────────────┴────────┴───────────────────────────────────────┴───────────────────────────────────────┴───────────┘

ChatGPT suggested that I explicitly, add permissions in the EcsService class so I made the following changes. But even after these changes the error remains the same.

// Create an execution role
const executionRole = new iam.Role(this, 'ExecutionRole', {
  assumedBy: new iam.ServicePrincipal('ecs-tasks.amazonaws.com'),
});

// Grant permissions to the execution role to pull from ECR
executionRole.addToPolicy(new iam.PolicyStatement({
  actions: [
    'ecr:GetAuthorizationToken'
  ],
  resources: ['*'],
}));

const ecrRepository = ecr.Repository.fromRepositoryName(this, `${id}Repo`, props.repoName);

const taskDefinition = new ecs.FargateTaskDefinition(this, `${id}TaskDef`, {
  executionRole: executionRole
});

How can I fix this issue?

Pankaj Jangid
  • 524
  • 3
  • 18
  • 1
    The usual cause of this is placing your service in an isolated subnet, but you're not doing that. Are you sure the repository name and tag are correct? Try removing the security group altogether and use the `connections` prop of the service instead. – gshpychka Aug 31 '23 at 08:54
  • @gshpychka I removed the `securityGroups` prop from the FargateService instance. But what do you mean by the `connections` prop. I don't see one here - https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_ecs.FargateServiceProps.html – Pankaj Jangid Aug 31 '23 at 09:21
  • it's a property of the service: https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_ecs.FargateService.html#connections – gshpychka Aug 31 '23 at 09:29
  • @gshpychka it is a read-only property. May that picks up values from `securityGroups` only. BTW I have tried to deploy with the `securityGroups` and `connections`. This time the diff showed security group changes as expected. It implicitly picked up required rules. But still did not work. The error is same. – Pankaj Jangid Aug 31 '23 at 10:09
  • Yes, it's a readonly property - you cannot overwrite it. Check the documentation for the type of the property to see examples. But yeah, it is not the cause of your issues if removing the security group didn't change anything. Do not include the changes suggested by ChatGPT - they are not necessary and might make the problem harder to track down. – gshpychka Aug 31 '23 at 10:25
  • @gshpychka thanks for sharing that. I read multiple answers and concluded that the ecs-task service must have access to the internet to fetch the docker image. This can be done in multiple ways - NAT, or assign public IP to the service. Later is cheaper in terms of cost. So I just passed `assignPublicIp: true` inside the service constructor props. And this solved the issue. It was my fault, when I refactored, I removed this prop thinking that this is unnecessary. – Pankaj Jangid Aug 31 '23 at 11:16

1 Answers1

0

With the help of @gshpychka, I could resolve this issue. Here is how I did it.

The issue was related to the ECS task's ability to access the internet to pull the Docker image from ECR. When I refactored the code, I inadvertently removed the assignPublicIp: true property from the Fargate service constructor, thinking it was unnecessary.

To resolve the issue, I added back the assignPublicIp: true inside the Fargate service constructor like so:

this. Service = new ecs.FargateService(this, id, {
  cluster: props. Cluster,
  taskDefinition,
  desiredCount: 1,
  assignPublicIp: true  // This line solved the issue
});

Adding this property ensures that the ECS task has internet access, allowing it to pull the Docker image successfully.

PS: I also removed vpcSubnets and securityGroups props. Those were unnecessary.

Pankaj Jangid
  • 524
  • 3
  • 18