DNS problem on AWS EKS when running in private subnets

Question

I have an EKS cluster setup in a VPC. The worker nodes are launched in private subnets. I can successfully deploy pods and services.

However, I'm not able to perform DNS resolution from within the pods. (It works fine on the worker nodes, outside the container.)

Troubleshooting using https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/ results in the following from nslookup (timeout after a minute or so):

Server: 172.20.0.10 Address 1: 172.20.0.10

nslookup: can't resolve 'kubernetes.default'

When I launch the cluster in an all-public VPC, I don't have this problem. Am I missing any necessary steps for DNS resolution from within a private subnet?

Many thanks, Daniel

is `kube-dns` or `core-dns` up? what does it say when you type `kubectl get pods -n kube-system`? check the the `/etc/resolv.conf` in the container in the pod, it should point to the `kube-dns/core-dns` internap IP address — Rico, Sep 11 '18 at 13:52
Rico, kube-dns is up and runnning. Not sure how I find the internal IP of the kube-dns, but the resolv.conf in the container looks like this: nameserver 10.100.0.10 search default.svc.cluster.local svc.cluster.local cluster.local eu-west-1.compute.internal us-west-2.compute.internal options ndots:5 — Daniel, Sep 11 '18 at 14:25
Found the IP of the kube-dns service, and it's 10.100.0.10, i.e. the same that is specified in /etc/resolv.conf in my container,. — Daniel, Sep 11 '18 at 14:50
So I believe `kubernetes.default.svc.cluster.local` should resolve. You can try in your container `dig @10.100.0.10 kubernetes.default` see if you have connectivity to your `kube-dns` — Rico, Sep 11 '18 at 15:03
You're right! The problem was the network ACLs in our custom VPC. Had to open up UDP traffic for kube-dns to work properly. Haven't been able to figure out which ports yet, seems like multiple ports (including 53) are required. Thanks for helping out! — Daniel, Sep 17 '18 at 11:23
How did you launch your master controller(aka cluster)? Is it only private subnets? or public subnets? or both?? I am setting one for me just wanted to understand both pros and cons for chosing subnets. Please recommend. Thanks! — Vaibhav Jain, Nov 17 '18 at 12:01
@Daniel did you ever sort out the exact ports you needed to open in the NACL for kube-dns to work properly? I think I'm facing similar issues with kube-dns in a split public/private subnetted VPC — Tommy Adamski, Nov 21 '18 at 21:21
@TommyAdamski simply allowing outbound UDP traffic on port 53 on my ACL worked for me - give it a few seconds to update before trying — apdm, Dec 07 '18 at 07:37
@apdm Thanks! When I opened up port 53, dns was finally reliable! — Tommy Adamski, Dec 08 '18 at 09:06

score 23 · Accepted Answer · answered Dec 07 '18 at 07:41

23

I feel like I have to give this a proper answer because coming upon this question was the answer to 10 straight hours of debugging for me. As @Daniel said in his comment, the issue I found was with my ACL blocking outbound traffic on UDP port 53 which apparently kubernetes uses to resolve DNS records.

The process was especially confusing for me because one of my pods worked actually worked the entire time since (I think?) it happened to be in the same zone as the kubernetes DNS resolver.

answered Dec 07 '18 at 07:41

apdm

1,260
1
14
33

Hey, i am facing the exact same issue. my pods cant resolve dns or reach the internet. I tried fixing the NACL but its not working I am using the default VPC , worker nodes are in private subnet while the subnet chosen in EKS are also private. Can you help? – WickStargazer Jan 01 '19 at 19:24
also just spent way too many hours debugging this. huge help! and thanks to @mattwilber below for being more explicit about what changes to make. – David Aug 09 '19 at 20:59
I was going through the exact same scenario and yes I too had some pods working just fine which I suspect to be in the same region as the DNS resolver. The problem turned out to be having UDP completely denied in the NACLs. Opening up UDP to the VPC solved the problem. – Roshan Amadoru Dec 25 '20 at 17:47
thanks it solved my problem, was scratching my head, finally i opened up TCP(53) in my ACLs for traditional telnet or connectivity checks and UDP(53) for DNS queries, also opened up ephemeral ports(e.g 1025-65535) Previously, whenever my pods were in same AZ then they were able to communicate, but in cross AZ `nslookup` was giving connection time out. When I opened up TCP/UDP(53) in ACLs telnet started working, but the nslookup queries were failing with timeout, then I opened the ephemeral ports as suggested by @Matt Wilber , then my `nslookup` queries started working. – Shivam Som Jul 04 '23 at 18:13

score 17 · Answer 2 · answered May 22 '19 at 21:21

To elaborate on the comment from @Daniel, you need:

an ingress rule for UDP port 53
an ingress rule for UDP on ephemeral ports (e.g. 1025–65535)

I hadn't added (2) and was seeing CoreDNS receiving requests and trying to respond, but the response wasn't getting back to the requester.

Some tips for others dealing with these kinds of issues, turn on CoreDNS logging by adding the log configuration to the configmap, which I was able to do with kubectl edit configmap -n kube-system coredns. See CoreDNS docs on this https://github.com/coredns/coredns/blob/master/README.md#examples This can help you figure out whether the issue is CoreDNS receiving queries or sending the response back.

To elaborate even further, the ingress rule for UDP port 53 does *not* need to be open - it can be restricted to IPs from VPC cidr block i.e. 10.0.0.0/16 — adamkgray, Sep 09 '20 at 11:42
Ephemeral ports likewise can be restricted to the IPs from the VPC cidr block. Consider using 1024-65535, as that is what AWS recommends. — adamkgray, Sep 11 '20 at 05:55

score 2 · Answer 3 · answered Mar 05 '21 at 01:55

I ran into this as well. I have multiple node groups, and each one was created from a CloudFormation template. The CloudFormation template created a security group for each node group that allowed the nodes in that group to communicate with each other.

The DNS error resulted from Pods running in separate node groups from the CoreDNS Pods, so the Pods were unable to reach CoreDNS (network communications were only permitted withing node groups). I will make a new CloudFormation template for the node security group so that all my node groups in my cluster can share the same security group.

I resolved the issue for now by allowing inbound UDP traffic on port 53 for each of my node group security groups.

score 1 · Answer 4 · answered Jan 01 '19 at 19:51

So I been struggling for a couple of hours i think, lost track of time, with this issue as well.

Since i am using the default VPC but with the worker nodes inside the private subnet, it wasn't working.

I went through the amazon-vpc-cni-k8s and found the solution.

We have to sff the environment variable of the aws-node daemonset AWS_VPC_K8S_CNI_EXTERNALSNAT=true.

You can either get the new yaml and apply or just fix it through the dashboard. However for it to work you have to restart the worker node instance so the ip route tables are refreshed.

issue link is here

thankz

That is great news then i would have had to come sniff through stackoverflow again if had to relaunch one :) — WickStargazer, Aug 13 '19 at 10:24

score 1 · Answer 5 · answered Sep 13 '19 at 18:57

Re: AWS EKS Kube Cluster and Route53 internal/private Route53 queries from pods

Just wanted to post a note on what we needed to do to resolve our issues. Noting that YMMV and everyone has different environments and resolutions, etc.

Disclaimer: We're using the community terraform eks module to deploy/manage vpcs and the eks clusters. We didn't need to modify any security groups. We are working with multiple clusters, regions, and VPC's.

ref: Terraform EKS module

CoreDNS Changes: We have a DNS relay for private internal, so we needed to modify coredns configmap and add in the dns-relay IP address ...

ec2.internal:53 {
    errors
    cache 30
    forward . 10.1.1.245
}
foo.dev.com:53 {
    errors
    cache 30
    forward . 10.1.1.245
}
foo.stage.com:53 {
    errors
    cache 30
    forward . 10.1.1.245
}

...

VPC DHCP option sets: Update with the IP of the above relay server if applicable--requires regeneration of the option set as they cannot be modified.

Our DHCP options set looks like this:

["AmazonProvidedDNS", "10.1.1.245", "169.254.169.253"]

ref: AWS DHCP Option Sets

Route-53 Updates: Associate every route53 zone with the VPC-ID that you need to associate it with (where our kube cluster resides and the pods will make queries from).

there is also a terraform module for that: https://www.terraform.io/docs/providers/aws/r/route53_zone_association.html

score 0 · Answer 6 · answered Nov 14 '19 at 10:12

We had run into a similar issue where DNS resolution times out on some of the pods, but re-creating the pod couple of times resolves the problem. Also its not every pod on a given node showing issues, only some pods.

It turned out to be due to a bug in version 1.5.4 of Amazon VPC CNI, more details here -- https://github.com/aws/amazon-vpc-cni-k8s/issues/641.

Quick solution is to revert to the recommended version 1.5.3 - https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html

score 0 · Answer 7 · answered Jun 14 '22 at 18:09

As many others, I've been struggling with this bug a few hours.

In my case the issue was this bug https://github.com/awslabs/amazon-eks-ami/issues/636 that basically sets up an incorrect DNS when you specify endpoint and certificate but not certificate.

To confirm, check

That you have connectivity (NACL and security groups) allowing DNS on TCP and UDP. For me the better way was to ssh into the cluster and see if it resolves (nslookup). If it doesn't resolve (most likely it is either NACL or SG), but check that the DNS nameserver in the node is well configured.
If you can get name resolution in the node, but not inside the pod, check that the nameserver in /etc/resolv.conf points to an IP in your service network (if you see 172.20.0.10, your service network should be 172.20.0.0/24 or so)

score 0 · Answer 8 · answered May 15 '23 at 14:56

I also banged my head on this class of problem for several days.

Specifically my problem was that I could VPN into my private subnet and resolve my management endpoint once. This is to say that on Monday (for example) from the office I could set up my VPN for the first time and use kubectl to my hearts content, but on Tuesday at home I would be able to connect to the VPN but when I ran any kubectl interaction I would get whiffed with:

Unable to connect to the server: dial tcp: lookup XXXXX.eks.amazonaws.com on 127.0.0.53:53: read udp 127.0.0.1:39491->127.0.0.53:53: i/o timeout

I think it is working again (but I've yet to see if it works tomorrow from the office); the change I made was to the Client VPN endpoints->Autorization rules. I added a rule to authorize access to the destination network containing the DNS server (VPC CIDR +2) which I had also called out explicitly as the DNS server for the Client VPN Enpoint itself.

DNS problem on AWS EKS when running in private subnets

8 Answers8

Linked