Aws ecs fargate ResourceInitializationError: unable to pull secrets or registry auth

Question

I am trying to run a private repository on aws-ecs-fargate-1.4.0 platform.

For private repository authentication, I have followed the docs and it was working well.

Somehow after updating existing service many times it goes fail to run the task and complain the error like

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to get registry auth from asm: service call has been retried 1 time(s): asm fetching secret from the service for <secretname>: RequestError: ...

I haven't change the ecsTaskExecutionRole and it contains all required policies to fetch secret value.

AmazonECSTaskExecutionRolePolicy
CloudWatchFullAccess
AmazonECSTaskExecutionRolePolicy
GetSecretValue
GetSSMParamters

This should have been related to the security group of your ECS. Make sure your inbound rules are correct (Protocole, port, ...) and that the outbound rules are allowing all traffic out (I got the error above because my outbound rule was set to a specific port) — Shams Larbi, Jun 07 '21 at 05:21
Do you try to run a task in one of the default subnets (auto-created by AWS in the default VPC)? I suggest you to explicitly say that in your question. This is a likely case for a subnet without a NAT gateway configured, and the default subnet does not have one. — Anton Bryzgalov, Nov 16 '22 at 07:31
I am getting the same error, but randomly! I start an AWS Batch job with 78 array size, of which approx. ~30 jobs run through, and ~45 fail to even start. Increased no. of retries to 10, but still no luck. I'm completely perplexed how (a) AWS Batch does not work out of the box (despite Public IP), and (b) the non-reproducibility of this. Any hints? — KingOtto, May 16 '23 at 13:52

nathanpeck · Answer 1 · 2021-03-25T15:51:27.307

223

AWS employee here.

What you are seeing is due to a change in how networking works between Fargate platform version 1.3.0, and Fargate platform version 1.4.0. As part of the change from using Docker to using containerd we also made some changes to how networking works. In version 1.3.0 and below each Fargate task got two network interfaces:

One network interface was used for the application traffic from your application container(s), as well as for logs and container image layer pulls.
A secondary network interface was used by the Fargate platform itself, to get ECR authentication credentials, and fetch secrets.

This secondary network interface had some downsides though. This secondary traffic did not show up in your VPC flow logs. Also while most traffic stayed in the customer VPC, the secondary network interface was sending traffic outside of your VPC. A number of customers complained that they did not have the ability to specify network level controls on this secondary network interface and what it was able to connect to.

To make the networking model less confusing and give customers more control, we changed in Fargate platform version 1.4.0 to using a single network interface and keeping all traffic inside of your VPC, even the Fargate platform traffic. The Fargate platform traffic for fetching ECR authentication and task secrets now uses the same task network interface as the rest of your task traffic, and you can observe this traffic in VPC flow logs, and control this traffic using the routing table in your own AWS VPC.

However, with this increased ability to observe and control the Fargate platform networking, you also become responsible for ensuring that there is actually a network path configured in your VPC that allows the task to communicate with ECR and AWS Secrets Manager.

There are a few ways to solve this:

Launch tasks into a public subnet, with a public IP address, so that they can communicate to ECR and other backing services using an internet gateway
Launch tasks in a private subnet that has a VPC routing table configured to route outbound traffic via a NAT gateway in a public subnet. This way the NAT gateway can open a connection to ECR on behalf of the task.
Launch tasks in a private subnet and make sure you have AWS PrivateLink endpoints configured in your VPC, for the services you need (ECR for image pull authentication, S3 for image layers, and AWS Secrets Manager for secrets).

You can read more about this change in this official blogpost, under the section "Task elastic network interface (ENI) now runs additional traffic flows"

https://aws.amazon.com/blogs/containers/aws-fargate-launches-platform-version-1-4/

edited Mar 25 '21 at 15:51

answered Mar 25 '21 at 15:44

nathanpeck

4,608
1
20
18

3

Thank you for the detailed explanation @nathanpeck, However we are facing the same issue in us-west-1 region today. We have verified that the task is running in public subnet, with public ip address. – aashitvyas Apr 09 '21 at 19:28
I have a NAT on my private subnet...no dice :( – Myles McDonnell Apr 13 '21 at 04:57
7

I ran into a similar error on a private subnet + NAT. In addition to making sure the NAT is setup correctly, you also need to make sure the role for the task can pull the secrets. These errors really need to show the full messages, otherwise it's hard to find the root cause (https://github.com/aws/containers-roadmap/issues/1133) – tonyc May 06 '21 at 01:09
It was missing NAT GW in my case - using private subnet. With internet GW alone it didn't work. Thanks for the hint. – Paul Lysak May 12 '21 at 10:00
4

@nathanpeck which option will cost less? – ian Jul 09 '21 at 00:34
@nathanpeck a) could the error message only come from the problem you described? b) do you have a guide regarding the PrivateLink setup in this regard? – lony Jul 23 '21 at 17:50
I am also getting the same error but it's saying about ACM ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve secret from asm: service call has been retried 1 time(s): failed to fetch secret arn:aws:secretsmanager:eu-west-1:xxxxxx...Note it was working before furthermore, I checked ACM secrets status is success and seeing no issue there – Ashish Karpe Aug 05 '21 at 04:23
I followed Nathan's instructions and assigned public IP's to my resources. Initially didn't help but turns out I needed to update the VPC routing table and add a final route that redirects "0.0.0.0/0" traffic to internet gateway. Now works fine. – e-mre Aug 23 '21 at 09:14
1

Heard of semantic versioning, AWS? This would make it version 2.0.0. What would be really nice would be for this to be configurable, for those weird snowflakes that think Fargate is exfiltrating data over DNS or some crap. – Phil Sep 09 '21 at 09:41
Hey @Phil, we considered whether to do a major version bump. However, the fundamentals of Fargate have not changed enough in this case to count as a breaking change. And in fact the vast majority of VPC configurations will work the same from 1.3.0 to 1.4.0. Most customer VPC's already have a route to ECR, S3, and Secrets Manager. This issue should only appear in specific locked down VPC configurations that limit access to only a specific set of whitelisted AWS services, rather than allowing general internet access (as the vast majority of VPC's do). – nathanpeck Sep 09 '21 at 16:26
47

"it only breaks a few instances" is most certainly a breaking change. It's not shades of grey. – Phil Sep 10 '21 at 01:12
4

I think the first option "Launch tasks into a public subnet, with a public IP address, so that they can communicate to ECR and other backing services using an internet gateway" is not really the best practice here, as it exposes the running tasks directly to public internet which means other hosts on the Internet can open connections to these tasks directly? Also we usually need put some network load balancer and firewall before these tasks? – Yang Liu Sep 22 '21 at 02:06
2

Fargate tasks that have public IP addresses can still have a security group that denies inbound traffic. Just because the task has a public IP address does not necessarily mean that the public will be able to initiate inbound communications to it, but it does allow the task to initiate outbound communications. Hope this addresses your concerns @YangLiu – nathanpeck Sep 22 '21 at 19:59
With this change, won't we be getting charged for network traffic related to the Fargate platform doing stuff? If so, why should I pay for network traffic related to code I don't control? – Gregory Ledray Sep 22 '21 at 20:42
Also I'd like to chip in and say this change has caused me a lot of grief and dozens of hours of time. I'm trying to run a secure, locked down system and the need to expose additional ports in my private subnets for traffic which only occurs at system startup bothers me a LOT. If something goes wrong and my instances are compromised I'll be on the hook and that really bothers me.I also find it shocking that AWS would allow a change which breaks existing networking setups. AWS networking is already a huge pain in the ass compared to GCP and this only makes it worse. – Gregory Ledray Sep 22 '21 at 20:49
This change does not require exposing any ports. You only need to allow outbound communications, not inbound communications. The security group can still block all inbound traffic on all ports if you wish. The only communications are outbound requests to AWS services that support Fargate. The network route to these services stays inside the AWS network. Hopefully this addresses the security concern @GregoryLedray – nathanpeck Oct 11 '21 at 20:17
1

"With this change, won't we be getting charged for network traffic related to the Fargate platform doing stuff? If so, why should I pay for network traffic related to code I don't control?" The most significant network charge you will see during Fargate task startup is the cost of downloading the container image. This is unchanged from 1.3.0 to 1.4.0 as that traffic still uses the same network interface it did before. The change is only for the auth token request and secrets, which are extremely small requests compared to Docker container images, which are often >100mb in size – nathanpeck Oct 11 '21 at 20:31
1

Thanks for responding @nathanpeck. I do understand now that these platform management requests should not have significant costs. After giving up and implementing directly on top of EC2 I got [a different error message](https://techsparx.com/software-development/aws/ec2/use-aws-cli.html) which led me to find out my [VPC's DNS has always been broken](https://serverfault.com/a/634556) and finally I realized that the reason I was having so much trouble was DNS was *sometimes* working. I.e. my grief was misdirected at Fargate when it really should have been directed at AWS VPC. Cheers. – Gregory Ledray Oct 12 '21 at 13:58
The issue may only be affecting "old" VPCs, e.g., those created before 2021 - my guess. I set up a new VPC and multiple Fargate tasks with no issues. Then, in an identical "old" VPC, a Fargate task required enabling public IP address to pull its image from the same ECR. – SVUser Mar 15 '22 at 15:39
Any chance someone has recommendation for how this should work in a cross account fetch? ( ECS in accountA wanting to pull ECR in accountB ). Of course the aws blog chose to launch into public IP subnet. https://aws.amazon.com/blogs/containers/sharing-amazon-ecr-repositories-with-multiple-accounts-using-aws-organizations/ – Quincy Mar 21 '22 at 14:07
I think I have to go with this direction... "Launch tasks in a private subnet that has a VPC routing table configured to route outbound traffic via a NAT gateway in a public subnet." ...but not sure how to approach with multiple accounts. vpc endpoints don't support cross account (https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html) – Quincy Mar 21 '22 at 15:20
i have a very strange issue, my tasks won't spin up the second time i try to start them. https://stackoverflow.com/questions/71619910/aws-fargate-tasks-wont-start-reliably if you have any insights on what's going on i would love to hear it! – F.H. Mar 25 '22 at 16:09
1

SO: Option 1: Insecure. Option 2: Costs 38$/month. Option 3: Wont work for multiple account setups. – Gal Silberman Feb 27 '23 at 15:35
The communication and rollout of this change, as well as the provided solutions, are outrageous. It is a pure reflection of why no one wants to work at Amazon. The only learning from this is to never set a version tag to `LATEST` - always use a specific version to prevent AWS from breaking your setup. – bkr879 Aug 01 '23 at 00:32

score 41 · Answer 2 · answered Apr 30 '20 at 18:24

41

I'm not completely sure about your setup but after I disabled the NAT-Gateways to save some $, I had a very similar error message on the aws-ecs-fargate-1.4.0 platform:

Stopped reason: ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 1 time(s): RequestError: send request failed caused by: Post https://api.ecr....

It turned out that I had to create VPC Endpoints to these Service names:

com.amazonaws.REGION.s3
com.amazonaws.REGION.ecr.dkr
com.amazonaws.REGION.ecr.api
com.amazonaws.REGION.logs
com.amazonaws.REGION.ssm

And I had to downgrade to the aws-ecs-fargate-1.3.0 platform. After the downgrade the Docker images could be pulled from ECR and the deployments succeeded again.

If you are using the secret manager without a NAT-Gateway, it might be that you have to create a VPC Endpoint for com.amazonaws.REGION.secretsmanager.

answered Apr 30 '20 at 18:24

Robert Reiz

4,243
2
30
43

1

for me it was enough to add an endpoint for ecr.api – dimisjim Sep 24 '20 at 07:06
6

I wouldn't consider a downgrade to a previous platform reasonable advice to getting this working. I feel this answer is also not clearly distinguishing the access to the secrets API and other issues. In my case granting the IAM privilege `secretsmanager:GetSecretValue`, along with opening up network access, especially as the ECR I'm trying to reach is in another account, were the keys to solving the issue. – JinnKo Nov 03 '20 at 12:10
2

Well, sometimes the newest Platform version is a bit buggy. Using the second latest version of something, many times makes sense because the second latest is more mature. In the meanwhile, I upgraded to version 1.4.0 and it works fine. – Robert Reiz Nov 06 '20 at 07:20
4

I cannot agree with your position @RobertReiz. Sure, sometimes versions contain bugs; but they are extensively tested, and the release date also gives you an indication about the maturity -- not just the position in the version history. Also, if it is working for you with 1.4.0 now this rules out that the platform version was the issue for you in the beginning; because if it was due to a bug, you would now be using 1.4.1 or higher as I am sure they are using semantic versioning. I think you should remove that aspect from your answer, as it appears to be not relevant. – Richard Kiefer Mar 17 '21 at 09:21
1

Downgrading is one way. AWS seem to insist on not using semantic versioning, and 1.4.0 is actually a breaking change over 1.3.0 in that in 1.4.0 a whole pile of service traffic also goes over the ENI and out your VPC, while in 1.3.0 that went out somewhere in AWSland where connectivity is managed for you. See my answer for more details. – Phil Sep 09 '21 at 09:50
this worked for me (Fargate 1.0.0 for Windows). Using Terraform I provisioned an "aws_vpc_endpoint" for each, and attached them to my private subnets. – MrJedi2U Jul 05 '22 at 05:53

score 30 · Answer 3 · answered Mar 09 '21 at 11:21

30

If you are using a public subnet and select "Don't assign public address", this error can happen.

The same is applicable if you have a private subnet and do not have an internet gateway or NAT gateway in your VPC. It needs a route to the internet.

This is the same behaviour across all of AWS ecosystem. It would be great if AWS can display a large banner warning in such cases.

answered Mar 09 '21 at 11:21

Sairam

2,708
1
25
34

1

Bear in mind that this leaves your container open to access from places other what you might have intended ( your load balancer for example). My logs would show requests by random IPs across the globe that seem like they're automated bots looking for vulnerabilities. – Ash Jul 03 '23 at 05:34
yes, agreed, public subnet is not a good practice, the answer points to what someone is donig wrong, but definitely not an indiciation of a good practice. public subnet and public network is not recommended for anyone unless someone wants to host a honeypot. – Sairam Jul 05 '23 at 03:42

score 20 · Answer 4 · answered Apr 18 '20 at 17:51

20

Ensure internet connectivity either via IGW or NAT and make sure public IP is Enabled, if its IGW in Fargate Task/Service network configuration.

{
  "awsvpcConfiguration": {
    "subnets": ["string", ...],
    "securityGroups": ["string", ...],
    "assignPublicIp": "ENABLED"|"DISABLED"
  }
}

answered Apr 18 '20 at 17:51

Mangal

607
4
8

1

Even though it doesn't complain if securityGroups is empty, I had to add one in order to resolve this error. – r590 Mar 17 '21 at 18:53
4

That was the answer. Using a non-public service will not be able to reach the image. – mmoreram Apr 11 '21 at 15:10

score 12 · Answer 5 · answered Apr 18 '20 at 16:58

This error occurs when the Fargate agent fails to create or bootstrap the resources required to start the container or the task is belongs to. This error only occurs if using platform version 1.4 or later, most likely because the version 1.4 uses Task ENI (which is in your VPC) instead of the Fargate ENI (which is in AWS's VPC). I'd think this might be caused by some need for extra IAM permissions needed to pull image from ECR. Are you using any privatelink? If yes, you might wanna take a look at the policies for ECR endpoint.

I'll try to replicate it but I'd suggest opening a support Ticket with AWS if you can so they can take a closer look at your resources and better suggest.

score 10 · Answer 6 · answered Jul 21 '20 at 14:09

Since ECS agent in FARGATE version 1.4.0 uses task ENI to retrieve information, the request to the Secret Manager will go through this eni.

You must ensure that the trafic to the Secret Manager api (secretsmanager.{region}.amazonaws.com) is 'open' :

if your task is private you must either have a vpc endpoint (com.amazonaws.{region}.secretsmanager) or a NAT gateway and the task ENI's security group must allow https outbound trafic to it.
if your task is public, the security group must allow https outbound trafic to the outside (or AWS public cidrs).

You'll also need to make sure that ENI is allowed to do DNS to resolve the endpoint - if you're using AmazonProvidedDNS then this will be fine, but if you're using your own, then you need to adjust your security group rule accordingly. See my answer for further details. — Phil, Sep 09 '21 at 09:48

score 9 · Answer 7 · answered Nov 10 '21 at 14:30

I got this problem after translating my Cloudformation file to a Terraform file.

After struggling, I found out that I was missing an outbound rule in my fargate security group. Indeed, AWS automatically creates an "ALLOW ALL" rule but terraform disables it. You need to add to your aws_security_group:

resource "aws_security_group" "example" {
  # ... other configuration ...

  egress = [
    {
      from_port        = 0
      to_port          = 0
      protocol         = "-1"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
    }
  ]
}

You can check the doc here.

score 8 · Answer 8 · answered Mar 16 '22 at 18:51

for my case i tried all of the above solutions and none seemed to be working. it was a very simple mistake but one that others might find useful if none of the answers work for you.

the valueFrom in the containerDefinition portion of the task definition json file needs :: at the end of the value.

i.e. in my case:

{
  "containerDefinitions": [{
    "secrets": [{
      "name": "MY_SECRET",
      "valueFrom": "arn:aws:secretsmanager:<region>:<aws_account_id>:secret:<sm_resource_name>:MY_SECRET"
    }]
  }]
}

correct format was:

{
  "containerDefinitions": [{
    "secrets": [{
      "name": "MY_SECRET",
      "valueFrom": "arn:aws:secretsmanager:<region>:<aws_account_id>:secret:<sm_resource_name>:MY_SECRET::"
    }]
  }]
}

note the extra :: at the end of the correct solution valueFrom.

This can happen if you tried to deploy some temporary stack and tried to hardcode one of the secrets. In that case you would have `"valueFrom": "TEST123"` instead of an ARN of a secret. — mvd, Sep 10 '22 at 00:01

score 6 · Answer 9 · answered Nov 13 '21 at 15:56

Go to Task Definitions > Update Task Definition. In the Task Role dropdown select ecsTaskExecutionRole.

You need to modify this ecsTaskExecutionRole in IAM settings to include the following permissions:

SecretsManagerReadWrite
CloudWatchFullAccess
AmazonSSMFullAccess
AmazonECSTaskExecutionRolePolicy

Then create your new task definition and should work.

Reference: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/specifying-sensitive-data-parameters.html

You saved the day, after hours of searching – Christian Apr 20 '23 at 10:50 — Christian, Apr 20 '23 at 10:50

Zags · Answer 10 · 2022-03-28T20:30:08.703

5

The service's security group needs outbound access on port 443 (outbound access on all ports will work for this). Without this, it can't access Secrets Manager.

edited Mar 28 '22 at 20:30

answered Aug 20 '21 at 19:08

Zags

37,389
14
105
140

3

The strange thing is it only works when I allow INBOUND access on port 443!!! I'm using NAT Gateway to allow internet access, but why would it needs inbound 443 port access where I serve the app over port 5000 ! – Amer Sawan Dec 22 '21 at 12:53

score 5 · Answer 11 · answered Dec 15 '21 at 00:46

5

I had to auto-assign public IP.

To do so from the console, when running the task, ...

... I had to select "ENABLED" for "Auto-assign public IP".

answered Dec 15 '21 at 00:46

Joshua Wolff

2,687
1
25
42

1

This solved my issue! As of Jan 2023, the new ECS UI has Public IP disabled by default. You have to override it under Networking tab when manually triggering a task. – S.S. Jan 23 '23 at 21:17
How safe of an option is this? Does providing a public IP for a cluster a bad idea, security wise? – dingo Jan 31 '23 at 14:20
1

Bear in mind that this leaves your container open to access from places other what you might have intended ( your load balancer for example). My logs would show requests by random IPs across the globe that seem from automated bots looking for vulnerabilities. – Ash Jul 03 '23 at 05:37

score 5 · Answer 12 · answered Sep 09 '22 at 08:12

5

Your problem maybe that you didn't assign public Ip to your cluster.

enable it while creating a task on the cluster.

Auto-assign public IP = TRUE

answered Sep 09 '22 at 08:12

Rajitha Bhanuka

714
10
11

I have to set `assign_public_ip = true`, My Fargate is running on default VPC – vanduc1102 Sep 23 '22 at 16:18

Phil · Answer 13 · 2021-09-09T09:43:36.860

This has burned me sufficiently well today that I figured I'd share my experience, since it differs from most all the above (AWS Employee's answer covers it technically, but doesn't spell the problem out).

If all the following are true:

You're running platform 1.4.0 (or, newer presumably - at the time of writing, 1.4.0 is the latest)
You're in a VPC environment
Your VPC, for "reasons", runs its own DNS (i.e. not at VPC_BASE+2)
For "reasons", you don't allow all outbound traffic, so you're setting egress rules on your task security group

And consequently, you have endpoints for all the things, then the following must also be true:

Your homegrown DNS will need to be able correctly resolve the private addresses of the endpoints (for instance, using VPC_BASE+2, but how doesn't matter)
You will also need to make sure your task security group has rules allowing DNS traffic to your DNS server(s) <-- This one burned me.

To add insult to the injury, what little error information you get out of Fargate doesn't really indicate that you have a DNS issue, and naturally your CloudTrails won't show a damn thing either, since nothing ends up hitting the API to start with.

omg you saved me, thank you! – Ievgen Goichuk Jun 07 '23 at 17:36 — Ievgen Goichuk, Jun 07 '23 at 17:36

score 4 · Answer 14 · answered Apr 25 '22 at 13:42

It should be mostly due to the outbound restriction in your security groups(in case of public subnet).

Making the TCP port open will help you to resolve the same.

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth

Liam · Answer 15 · 2023-03-10T16:53:52.653

4

The new VPC connection map helps a lot with this issue. Ensure that your public subnets have a route to the internet gateway, Your configuration should look something like this:

If it doesn't then you will need to change or add a routing table.

with at least one subnet able to connect to the internet gateway.

You need this to ensure that ECS can pull the image from the public url

edited Mar 10 '23 at 16:53

answered Mar 10 '23 at 16:10

Liam

27,717
28
128
190

score 3 · Answer 16 · answered Jul 14 '20 at 14:28

I was having the exact same issue using Fargate as the launch type with the platform version 1.4.0. At the end, since I was using public subnets, all I needed to do was to enable the assignment of public ip to the tasks in order to allow the task to have outbound network access to pull the image.

I got the hint to solve it when I tried to create the service with using the platform version 1.3.0 and the task creation failed with a similar but fortunately documented error.

score 3 · Answer 17 · answered Jan 26 '22 at 01:57

How to do "Launch tasks in a private subnet that has a VPC routing table configured to route outbound traffic via a NAT gateway in a public subnet. This way the NAT gateway can open a connection to ECR on behalf of the task" :

Assumptions of this solution:

You have docker image in ECR repository
You have an IAM role with the permissions, AmazonECSTaskExecutionRolePolicy
You also want your task to use the same IP address. I have marked this optional if you do not need this part.

Solution:

Create new cluster
- AWS > ECS > Clusters > Create cluster > Networking only > check box to create VPC > Create
Create new task definition
- AWS > ECS > Task Definitions > Create new task definition > Fargate
  - Add container > Image* field should contain Image URI from ECR
Create Elastic IP address (OPTIONAL, ONLY IF YOU WANT CONSISTENT IP OUTPUT, LIKE IF USING PROXY SERVICE)
- AWS > VPC > Elastic IPs > Allocate Elastic IP address > Create
- Whitelist this IP on whatever service Fargate is going to try and access
Create NAT gateway
- AWS > VPC > NAT Gateways > Create NAT gateway
  - Choose auto-created subnet
  - Connectivity type: Public
  - ^Since you made it public on a subnet this is what is meant by "NAT gateway in a public subnet"
  - (OPTIONAL) Select Elastic IP from dropdown
Route public subnets to use internet gateway
- AWS > VPC > Route tables > find one w/ public subnets auto-created in step 1 > click on Route table ID > Edit routes > Add route > Destination is 0.0.0.0/0, Target is igw-{internet-gateway-autocreated-in-step-1}
- ^This is what allows the VPC to actually access the internet at all
Create subnet
- AWS > VPC > Subnets > Create subnet > select auto-created VPC in step 1, for IPv4 if you're confused just put 10.0.0.0/24 > Add new subnet
Route newly created subnet (in step 6) to use NAT
- AWS > VPC > Route tables > find one w/ subnet created in step 6 > click on Route table ID > Edit routs > Add route > Destination: 0.0.0.0/0, Target: nat-{nat-gateway-created-in-step-4}
- ^This is what is meant by "private subnet that has a VPC routing table configured to route outbound traffic via a NAT gateway"
Run the Fargate task
- AWS > ECS > Clusters > your cluster > Run new Task
- Launch type: Fargate
- Task definition: your task
- Cluster: your cluster
- Cluster VPC: your VPC
- Subnet: subnet you created, NOT the auto-created ones
- Auto-assign public IP: this depends on if you are using an Elastic IP. If you did do that, then this should be disabled. If you did not allocate an Elastic IP address, then this should be enabled.
- Run task

score 3 · Answer 18 · answered Oct 24 '22 at 15:44

3

I solved this by setting "Assign public IP" = ENABLED in my job definition.

Ref: AWS Batch Timeout connecting to ECR

answered Oct 24 '22 at 15:44

Kevin Liu

667
5
14

score 2 · Answer 19 · answered May 06 '20 at 07:41

2

I resolved a similar problem by updating rules in ECS Service's Security Group. Below rules configuration.

Inbound Rules:
* HTTP          TCP   80    0.0.0.0/0
Outbound Rules:
* All traffic   All   All   0.0.0.0/0

answered May 06 '20 at 07:41

torm

1,486
15
25

This one solved my issue. I am not sure why my outbound rule was gone though. Maybe I have deleted it by accident. Thanks man! – Mike Rayco Apr 12 '22 at 06:27
Setting inbound rules to allow all is probably a bad idea – Ramon G. Feb 15 '23 at 19:09

Yang Liu · Answer 20 · 2021-10-15T01:27:01.363

I had this issue, and eventually sorted it out.

My solution below is to:

Set up the ECS in private subnet
Add AWS PrivateLink endpoints in VPC

Post my CDK code here for reference. I pasted some documentation links in the function comments for you to better understand its purpose.

This is the EcsStack:

export class EcsStack extends Stack {

    constructor(scope: cdk.App, id: string, props: EcsStackProps) {
        super(scope, id, props);
        this.createOrderServiceCluster(props.vpc);
    }

    private createOrderServiceCluster(serviceVpc:ec2.IVpc) {
        const ecsClusterName = "EcsClusterOfOrderService";

        const OrderServiceCluster = new ecs.Cluster(this, ecsClusterName, {
          vpc: serviceVpc,
          clusterName: ecsClusterName
        });

        // Now ApplicationLoadBalancedFargateService just pick a randeom private subnet.
        // https://github.com/aws/aws-cdk/issues/8621
        new ecs_patterns.ApplicationLoadBalancedFargateService(this, "FargateOfOrderService", {
          cluster: OrderServiceCluster, // Required
          cpu: 512, // Default is 256
          desiredCount: 1, // Default is 1
          taskImageOptions: { 
            image: ecs.ContainerImage.fromRegistry("12345.dkr.ecr.us-east-1.amazonaws.com/comics:user-service"),
            taskRole: this.createEcsTaskRole(),
            executionRole: this.createEcsExecutionRole(),
            containerPort: 8080
          },
          memoryLimitMiB: 2048, // Default is 512
          // creates a public-facing load balancer that we will be able to call 
          // from curl or our web browser. This load balancer will forward calls 
          // to our container on port 8080 running inside of our ECS service.
          publicLoadBalancer: true // Default is false
        });
    }

    /**
     * This IAM role is the set of permissions provided to the ECS Service Team to execute ECS Tasks on your behalf.
     * It is NOT the permissions your application will have while executing.
     * https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_execution_IAM_role.html
     * @private
     */
    private createEcsExecutionRole() : iam.IRole {
        const ecsExecutionRole = new iam.Role(this, 'EcsExecutionRole', {
            //assumedBy: new iam.ServicePrincipal(ecsTasksServicePrincipal),
            assumedBy: new iam.ServicePrincipal("ecs-tasks.amazonaws.com"),
            roleName: "EcsExecutionRole",
        });
        ecsExecutionRole.addManagedPolicy(iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEC2ContainerRegistryReadOnly'));
        ecsExecutionRole.addManagedPolicy(iam.ManagedPolicy.fromAwsManagedPolicyName('CloudWatchLogsFullAccess'));
        return ecsExecutionRole;
    }


    /**
     * Creates the IAM role (with all the required permissions) which will be used by the ECS tasks.
     * https://docs.aws.amazon.com/AmazonECS/latest/developerguide/instance_IAM_role.html
     * @private
     */
    private createEcsTaskRole(): iam.IRole {
        const ecsTaskRole = new iam.Role(this, 'OrderServiceEcsTaskRole', {
            //assumedBy: new iam.ServicePrincipal(ecsTasksServicePrincipal),
            assumedBy: new iam.ServicePrincipal("ecs-tasks.amazonaws.com"),
            roleName: "OrderServiceEcsTaskRole",
        });

        ecsTaskRole.addManagedPolicy(iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEC2ContainerRegistryReadOnly'));
        ecsTaskRole.addManagedPolicy(iam.ManagedPolicy.fromAwsManagedPolicyName('CloudWatchLogsFullAccess'));
        ecsTaskRole.addManagedPolicy(iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonS3ReadOnlyAccess'));

        return ecsTaskRole;
    }

}

This is code snippet of the VpcStack:

export class VpcStack extends Stack {
    readonly coreVpc : ec2.Vpc;
    constructor(scope: cdk.App, id: string) {
        super(scope, id);

        this.coreVpc = new ec2.Vpc(this, "CoreVpc", {
            cidr: '10.0.0.0/16',
            natGateways: 1,
            enableDnsHostnames: true,
            enableDnsSupport: true,
            maxAzs: 3,
            subnetConfiguration: [
            {
              cidrMask: 28,
              name: 'Public',
              subnetType: ec2.SubnetType.PUBLIC,
            },
            {
              cidrMask: 24,
              name: 'Private',
              subnetType: ec2.SubnetType.PRIVATE,
            }
          ]
        });
   
        this.setupInterfaceVpcEndpoints();
    }



    /**
     * Builds VPC endpoints to access AWS services without using NAT Gateway.
     * @private
     */
    private setupInterfaceVpcEndpoints(): void {
        // Allow ECS to pull Docker images without using NAT Gateway
        // https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html
        this.addInterfaceEndpoint("ECRDockerEndpoint", ec2.InterfaceVpcEndpointAwsService.ECR_DOCKER);
        this.addInterfaceEndpoint("ECREndpoint", ec2.InterfaceVpcEndpointAwsService.ECR);
        this.addInterfaceEndpoint("SecretManagerEndpoint", ec2.InterfaceVpcEndpointAwsService.SECRETS_MANAGER);
        this.addInterfaceEndpoint("CloudWatchEndpoint", ec2.InterfaceVpcEndpointAwsService.CLOUDWATCH);
        this.addInterfaceEndpoint("CloudWatchLogsEndpoint", ec2.InterfaceVpcEndpointAwsService.CLOUDWATCH_LOGS);
        this.addInterfaceEndpoint("CloudWatchEventsEndpoint", ec2.InterfaceVpcEndpointAwsService.CLOUDWATCH_EVENTS);
        this.addInterfaceEndpoint("SSMEndpoint", ec2.InterfaceVpcEndpointAwsService.SSM);
    }

    private addInterfaceEndpoint(name: string, awsService: ec2.InterfaceVpcEndpointAwsService): void {
        const endpoint: ec2.InterfaceVpcEndpoint = this.coreVpc.addInterfaceEndpoint(`${name}`, {
            service: awsService
        });

        endpoint.connections.allowFrom(ec2.Peer.ipv4(this.coreVpc.vpcCidrBlock), endpoint.connections.defaultPort!);
    }
}

score 2 · Answer 21 · answered Mar 20 '23 at 13:56

2

In my case, I assigned a public IP to the job definition as following:

answered Mar 20 '23 at 13:56

Ilyas

1,976
15
9

score 1 · Answer 22 · answered Nov 24 '21 at 18:43

1

If you are placing the tasks in a private subnet you might need to add Inbound and Outbound rules to allow traffic to the associated ACL.

answered Nov 24 '21 at 18:43

Juan Cruz

33
1
1
8

score 1 · Answer 23 · answered Aug 07 '23 at 07:18

1

I have the same issue because I didn't turn on public IP. After I turned it on, my service was deployed smoothly.

answered Aug 07 '23 at 07:18

Bigto

433
1
4
7

Thank you for your interest in contributing to the Stack Overflow community. This question already has quite a few answers—including one that has been extensively validated by the community. Are you certain your approach hasn’t been given previously? **If so, it would be useful to explain how your approach is different, under what circumstances your approach might be preferred, and/or why you think the previous answers aren’t sufficient.** Can you kindly [edit] your answer to offer an explanation? – Jeremy Caney Aug 10 '23 at 03:32
@Bigto, may I ask where is this toggle located? – Master Aug 22 '23 at 15:23

score 0 · Answer 24 · answered Sep 03 '21 at 06:46

0

for me it was a combination of not having secretsmanagerreadwrite policy attached to my IAM role (thanks Jinkko); AND not having public ip enabled on the compute instance (to get to the ECR repo)

answered Sep 03 '21 at 06:46

Ben

1,292
1
13
21

score 0 · Answer 25 · answered Nov 28 '21 at 17:38

In the ecsTaskExecutionRole => ECS-SecretsManager-Permission policy make sure your region-specific Secret is added with the correct Access Level. Sometimes if you are working on a multi-region setup with the Secret created in one region then cloned it to another region, you still have to add it to ecsTaskExecutionRole => ECS-SecretsManager-Permission to make it accessible to your regional ECS.

score 0 · Answer 26 · answered Mar 01 '22 at 18:43

For me I have a VPC with public and private subnets and nat gateway between public and private subnets. When I was trying to access secrets the service had to be launched in private subnets. Secret retrieval doesn't work in public subnets unless you have setup vpc endpoints. Works fine in private subnets using Fargate 1.4 version.

score 0 · Answer 27 · answered Mar 04 '22 at 00:57

0

For me, my problem was that the NAT gateway I had configured for my private subnet was incorrectly configured as a private NAT gateway. Oops. Changing to a public NAT gateway and updating route tables resolved my problem

answered Mar 04 '22 at 00:57

Kurru

14,180
18
64
84

score 0 · Answer 28 · answered Apr 21 '22 at 07:01

After checking everything on this AWS support page: https://aws.amazon.com/premiumsupport/knowledge-center/ecs-unable-to-pull-secrets/ and the other popular answers here, one more thing to check is that your secret that is being retrieved actually has a value set.

When using Secrets Manager, if your ECS Task is attempting to retrieve a secret that has been created but does not have a value set, then you will also receive this kind of error.

Setting a value for the secret will resolve this particular problem.

score -1 · Answer 29 · answered Mar 10 '21 at 10:30

If your Fargate is running in a private subnet with no access to internet, technically within your vpc should already have dkr vpc endpoint in place such that your Fargate (ver 1.3 and below) could reach to that endpoint and spin up the container. For ver 1.4 of Fargate, just need additional api ecr endpoint.

https://aws.amazon.com/blogs/containers/aws-fargate-launches-platform-version-1-4/

score -1 · Answer 30 · answered May 08 '21 at 20:06

-1

I just had this issue and the reason I was getting it was because I forgot to add inbound and outbound rules to the security group associated with my service. (added inbound from my ALB and outbound *)

answered May 08 '21 at 20:06

MillerC

663
1
11
26

score -1 · Answer 31 · edited Dec 04 '21 at 12:55

-1

For me it was incorrect secret ARNs referenced in my task role.

edited Dec 04 '21 at 12:55

lkatiforis

5,703
2
16
35

answered Dec 03 '21 at 18:05

cam

9
1

Aws ecs fargate ResourceInitializationError: unable to pull secrets or registry auth

31 Answers31

Linked