50

Slightly tearing my hair out with this one... I am trying to run a Docker image on Fargate in a VPC in a Public subnet. When I run this as a Task I get:

ResourceInitializationError: unable to pull secrets or registry auth: pull
command failed: : signal: killed

If I run the Task in a Private subnet, through a NAT, it works. It also works if I run it in a Public subnet of the default VPC.

I have checked through the advice here:

Aws ecs fargate ResourceInitializationError: unable to pull secrets or registry auth

In particular, I have security groups set up to allow all traffic. Also Network ACL set up to allow all traffic. I have even been quite liberal with the IAM permissions, in order to try and eliminate that as a possibility:

The task execution role has:

   {
        "Action": [
            "kms:*",
            "secretsmanager:*",
            "ssm:*",
            "s3:*",
            "ecr:*",
            "ecs:*",
            "ec2:*"
        ],
        "Resource": "*",
        "Effect": "Allow"
    }

With trust relationship to allow ecs-tasks to assume this role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs-tasks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The security group is:

sg-093e79ca793d923ab All traffic All traffic All 0.0.0.0/0

And the Network ACL is:

Inbound
Rule number Type Protocol Port range Source Allow/Deny
100 All traffic All All 0.0.0.0/0    Allow
*   All traffic All All 0.0.0.0/0    Deny

Outbound
Rule number Type Protocol Port range Destination Allow/Deny
100 All traffic All All 0.0.0.0/0    Allow
*   All traffic All All 0.0.0.0/0    Deny

I set up flow logs on the subnet, and I can see that traffic is Accept Ok in both directions.

I do not have any Interface Endpoints set up to reach AWS services without going through the Internet Gateway.

I also have Public IP address assigned to the Fargate instance upon creation.

This should work, since the Public subnet should have access to all needed services through the Internet Gateway. It also works in the default VPC or a Private subnet.

Can anyone suggest what else I should check to debug this?

user2800708
  • 1,890
  • 2
  • 18
  • 31

8 Answers8

49

One of the potential problems for ResourceInitializationError: unable to pull secrets or registry auth: pull command failed: : signal: killed is disabled Auto-assign public IP. After I enabled it (recreating service from scratch), the task run properly without issues.

enter image description here

valdem
  • 755
  • 5
  • 10
  • Hi valdem, where should I enable the Auto-assign public IP? – Chez May 11 '21 at 06:18
  • 1
    Hi Chez. I updated the answer, adding the screenshot where you can configure Auto-assign public IP – valdem May 11 '21 at 12:54
  • 25
    But what if you don't want the task to have a public IP? – TheRennen Jun 05 '21 at 21:11
  • This solves the issue, but if you want fargate in a private subnet then it still does not reach ERC (in my case not even with DNS on and Private Link Endpoint to ECR) – santamanno Jun 10 '21 at 20:06
  • This solved my issue for a public subnet. Private subnet is a different beast. – Ryan Walls Jun 11 '21 at 04:56
  • 2
    For private subnets you will likely need to have a NAT gateway. That will also allow you to have tasks without a public IP. Note that NAT gateways are pretty expensive. You are often better off with a public IP and a locked down security group. – morras Jul 07 '21 at 08:13
  • @valdem - many thanks, you save my day! (BTW, this issue seems really weird - AFAIK, we should be able to run instances w/o public IP into public subnet) – Vitaly Karasik DevOps Dec 29 '21 at 10:09
  • 1
    Without a public IP, your instance can't communicate with the internet (or in this case the ECR registry, which is outside of the vpc), because the receiving end does not know where the send the packets back. In case of private subnet, the NAT gateway has the public IP (and it can route the packet back to the original instance, because the NAT is inside the subnet). – Kicsi Jan 19 '22 at 09:31
15

Edited answer based of feedback from @nathan and @howard-swope

checklist:

  • The VPC has "DNS hostnames" and "DNS resolution" enabled
  • "Task execution role" has access to ECR. e.g. has role AmazonECSTaskExecutionRolePolicy

if task is running on a PUBLIC subnet:

  • The subnets have access to internet. i.e. assigning internet gateway to the subnets.

  • Enable "assign public IP" when creating the task.

if task is running on a PRIVATE subnet:

  • The subnets have access to internet. i.e. assigning NAT gateway to the subnets. ... NAT gateway resides on a public subnet
Koroslak
  • 643
  • 1
  • 6
  • 12
  • 1
    This is a good checklist for people running a container which will fire tasks in a private subnet with a VPC routing table configured to route outbound traffic via a NAT gateway which resides in a public subnet. – Nathan Dec 22 '21 at 18:58
  • 2
    @Nathan I am not sure this is accurate. If you are speaking of ECS tasks I don't believe they are fired from a container, it is the other way around. The task pulls and launches the container. And if your containers are running in a private subnet they should not of have public IP's. That is the point of the private subnet, is it not? – Howard Swope Feb 11 '22 at 16:33
  • 1
    You are correct about the tasks pulling the containers. The ECR is not located in the private subnet and __something__ needs to handle that. For tasks that run in a private subnet, either a NAT gateway handles the packet resolution OR a public IP address needs to be assigned. – Nathan Feb 13 '22 at 14:07
  • 1
    @HowardSwope you're correct. My original post assumes that the task is in a PUBLIC subnet. I'll edit my answer. THANKS FOR THE FEEDBACK! :) – Koroslak Feb 14 '22 at 08:54
14

For those unlucky souls, there is one more thing to check.

I already had an internet gateway in my VPC, DNS was enabled for that VPC, all containers were getting public IPs and the execution role already had access to ECR. But even so, I was still getting the same error.

Turns out the problem was about Routing Table. The routing table of my VPC didn't include a route for directing outbound traffic to internet gateway so my subnet had no internet access.

Adding the second line to the table that routes 0.0.0.0/0 traffic to internet gateway solved the issue.

enter image description here

e-mre
  • 3,305
  • 3
  • 30
  • 46
12

I was facing the same issue. But in my case, I was triggering the Fargate Container from the Lambda function using the RunTask operation. So In the RunTask operation, I was not passing the below parameter:

assignPublicIp: ENABLED

After adding this, Container was triggering without any issues.

Gurudeepak
  • 372
  • 2
  • 14
6

It turns out that I did not have DNS support enabled for the VPC. Once this is enabled, it works.

I did not see DNS support explicitly mentioned in any docs for Fargate - I guess its pretty obvious or how else will it look up the various AWS services it needs. But thought it worth noting in an answer against this error message.

user2800708
  • 1,890
  • 2
  • 18
  • 31
  • Did you have to add VPC endpoints as well for each service the container uses? – santamanno Jun 10 '21 at 20:07
  • @santamanno Yes, you need to create a VPC endpoint for each service. – Irtiza Aug 13 '21 at 06:38
  • Yes, thank you. It must be either on a public subnet, a private subnet with NAT or private VPC endpoints to the required services. In any case, as the OP points out, DNS resolution must be enabled in my experience. – santamanno Aug 14 '21 at 11:42
4

For AWS Batch using Fargate, this error was triggered by the 'Assign public IP' setting being disabled.

This setting is configurable during Job Definition step. However, it is not configurable in the UI after the Job Definition had already been created.

enter image description here

Kermit
  • 4,922
  • 4
  • 42
  • 74
  • This is helpful, the main answers did not specify where to enable this parameter and I did not face the "Create Service" interface because I'm creating my job definitions with CDK. – Outpox Nov 26 '21 at 11:04
  • For boto3 it took me a bit to find it. For the JobDefinition it is under ContainerProperties > NetworkConfiguration > AssignPublicIp: ENABLED. https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-batch-jobdefinition-containerproperties-networkconfiguration.html – Moemars Apr 30 '22 at 21:15
  • This is helpful, this can be done while creating a new revision of existing "Job definition". – isudarsan Jun 26 '22 at 14:37
3

AWS container runner needs to access to the container repositories, and AWS service.

If you're on a public subnet, the easiest is to "Auto-assign public IP" to have your containers access to internet, even if your app do not need egress access to internet.

Otherwise, if you're using only AWS services (ECR, and no images pulled from docker.io), then you could use VPC endpoints to access ECR/S3/Cloudwatch, and enabling DNS options on your VPC.

For private subnet, it's the same.

If you're using docker.io images, then you need egress access to internet in your subnet anyway.

FredG
  • 712
  • 7
  • 10
  • 1
    Sorry, does not work. I have a private subnet with NAT (and tried without) and all the endpoint added to the VPC, still unreachable... – santamanno Jun 10 '21 at 20:21
  • 1
    If you have a NAT gateway for egress traffic on a private network (without Internet gateway/public IP on instances/tasks), it's not even necessary to use VPC endpoints. I'd recommand you launch a EC2 instance on your subnet, ssh to it, and test your connectivity there. AWS network setup could be quite frustrating to get right. – FredG Jun 11 '21 at 12:39
1

In my case of dealing with the above error, while running the run-task command(yes, not via Service route), I was not specifying the security group in the aws ecs run-task --network-configuration. This was resulting in the default SG being picked up from the task VPC. My default SG in that VPC had no inbound/outbound rules defined. I added ONLY the outbound rule to allow all traffic to everywhere and the error went away.

My setup is that the ECS/Fargate task will run in a private subnet with ECR connectivity via VPC Interface endpoints. I had the checklist, mentioned above, checked and in addition added the SG rule.

user2622263
  • 59
  • 1
  • 6