43

I am encountering an issue with Kubernetes where my pods cannot resolve hostnames (such as google.com or kubernetes.default).

I currently have 1 master and 1 node running on two CentOS7 instances in OpenStack. I deployed using kubeadm.

Here are the versions installed:

kubeadm-1.7.3-1.x86_64
kubectl-1.7.3-1.x86_64
kubelet-1.7.3-1.x86_64
kubernetes-cni-0.5.1-0.x86_64

The below outlines some verification steps to maybe give some insight into my problem.

I define a busybox pod:

apiVersion: v1
kind: Pod
metadata:
  name: busybox
  namespace: default
spec:
  containers:
  - image: busybox
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
    name: busybox
  restartPolicy: Always

And then create the pod:

$ kubectl create -f busybox.yaml

Try to perform a DNS lookup of name google.com:

$ kubectl exec -ti busybox -- nslookup google.com
Server:    10.96.0.10
Address 1: 10.96.0.10
nslookup: can't resolve 'google.com'

Try to perform a DNS lookup of name kubernetes.default:

$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server:    10.96.0.10
Address 1: 10.96.0.10
nslookup: can't resolve 'kubernetes.default'

Check if my DNS pod is running:

$ kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME                        READY     STATUS    RESTARTS   AGE
kube-dns-2425271678-k1nft   3/3       Running   9          5d

Check if my DNS service is up:

$ kubectl get svc --namespace=kube-system
NAME       CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE
kube-dns   10.96.0.10   <none>        53/UDP,53/TCP   5d

Check if DNS endpoints are exposed:

$ kubectl get ep kube-dns --namespace=kube-system
NAME       ENDPOINTS                     AGE
kube-dns   10.244.0.5:53,10.244.0.5:53   5d

Check the contents of /etc/resolv.conf in my container:

$ kubectl exec -ti busybox -- cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

If I am understand correctly, the Kubernetes documentation states that my pods should inherit the DNS configurations of the node (or master?). However, even with just one line in it (nameserver 10.92.128.40), I receive the below warning when spinning up a pod:

Search Line limits were exceeded, some dns names have been omitted, the applied search line is: default.svc.cluster.local svc.cluster.local cluster.local mydomain.net anotherdomain.net yetanotherdomain.net

I understand there exists a known issue where only so many items can be listed in /etc/resolv.conf. However, where would the above search line and nameserver in my container be generated from?

Finally here are the logs from the kube-dns container:

$ kubectl logs --namespace=kube-system $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name) -c kubedns
I0817 20:54:58.445280       1 dns.go:48] version: 1.14.3-4-gee838f6
I0817 20:54:58.452551       1 server.go:70] Using configuration read from directory: /kube-dns-config with period 10s
I0817 20:54:58.452616       1 server.go:113] FLAG: --alsologtostderr="false"
I0817 20:54:58.452628       1 server.go:113] FLAG: --config-dir="/kube-dns-config"
I0817 20:54:58.452638       1 server.go:113] FLAG: --config-map=""
I0817 20:54:58.452643       1 server.go:113] FLAG: --config-map-namespace="kube-system"
I0817 20:54:58.452650       1 server.go:113] FLAG: --config-period="10s"
I0817 20:54:58.452659       1 server.go:113] FLAG: --dns-bind-address="0.0.0.0"
I0817 20:54:58.452665       1 server.go:113] FLAG: --dns-port="10053"
I0817 20:54:58.452674       1 server.go:113] FLAG: --domain="cluster.local."
I0817 20:54:58.452683       1 server.go:113] FLAG: --federations=""
I0817 20:54:58.452692       1 server.go:113] FLAG: --healthz-port="8081"
I0817 20:54:58.452698       1 server.go:113] FLAG: --initial-sync-timeout="1m0s"
I0817 20:54:58.452704       1 server.go:113] FLAG: --kube-master-url=""
I0817 20:54:58.452713       1 server.go:113] FLAG: --kubecfg-file=""
I0817 20:54:58.452718       1 server.go:113] FLAG: --log-backtrace-at=":0"
I0817 20:54:58.452727       1 server.go:113] FLAG: --log-dir=""
I0817 20:54:58.452734       1 server.go:113] FLAG: --log-flush-frequency="5s"
I0817 20:54:58.452741       1 server.go:113] FLAG: --logtostderr="true"
I0817 20:54:58.452746       1 server.go:113] FLAG: --nameservers=""
I0817 20:54:58.452752       1 server.go:113] FLAG: --stderrthreshold="2"
I0817 20:54:58.452759       1 server.go:113] FLAG: --v="2"
I0817 20:54:58.452765       1 server.go:113] FLAG: --version="false"
I0817 20:54:58.452775       1 server.go:113] FLAG: --vmodule=""
I0817 20:54:58.452856       1 server.go:176] Starting SkyDNS server (0.0.0.0:10053)
I0817 20:54:58.453680       1 server.go:198] Skydns metrics enabled (/metrics:10055)
I0817 20:54:58.453692       1 dns.go:147] Starting endpointsController
I0817 20:54:58.453699       1 dns.go:150] Starting serviceController
I0817 20:54:58.453841       1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0817 20:54:58.453852       1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I0817 20:54:58.964468       1 dns.go:171] Initialized services and endpoints from apiserver
I0817 20:54:58.964523       1 server.go:129] Setting up Healthz Handler (/readiness)
I0817 20:54:58.964536       1 server.go:134] Setting up cache handler (/cache)
I0817 20:54:58.964545       1 server.go:120] Status HTTP port 8081

The dnsmasq container. Disregard that it found several more nameservers than just the one I said was in my resolv.conf, as I did have more in their originally. I attempted to simply it by removing the extras:

$ kubectl logs --namespace=kube-system $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name) -c dnsmasq
I0817 20:55:03.295826       1 main.go:76] opts: {{/usr/sbin/dnsmasq [-k --cache-size=1000 --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/ip6.arpa/127.0.0.1#10053] true} /etc/k8s/dns/dnsmasq-nanny 10000000000}
I0817 20:55:03.298134       1 nanny.go:86] Starting dnsmasq [-k --cache-size=1000 --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/ip6.arpa/127.0.0.1#10053]
I0817 20:55:03.731577       1 nanny.go:111] 
W0817 20:55:03.731609       1 nanny.go:112] Got EOF from stdout
I0817 20:55:03.731642       1 nanny.go:108] dnsmasq[9]: started, version 2.76 cachesize 1000
I0817 20:55:03.731656       1 nanny.go:108] dnsmasq[9]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
I0817 20:55:03.731681       1 nanny.go:108] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain ip6.arpa 
I0817 20:55:03.731689       1 nanny.go:108] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa 
I0817 20:55:03.731695       1 nanny.go:108] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain cluster.local 
I0817 20:55:03.731704       1 nanny.go:108] dnsmasq[9]: reading /etc/resolv.conf
I0817 20:55:03.731710       1 nanny.go:108] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain ip6.arpa 
I0817 20:55:03.731717       1 nanny.go:108] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa 
I0817 20:55:03.731723       1 nanny.go:108] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain cluster.local 
I0817 20:55:03.731729       1 nanny.go:108] dnsmasq[9]: using nameserver 10.92.128.40#53
I0817 20:55:03.731735       1 nanny.go:108] dnsmasq[9]: using nameserver 10.92.128.41#53
I0817 20:55:03.731741       1 nanny.go:108] dnsmasq[9]: using nameserver 10.95.207.66#53
I0817 20:55:03.731747       1 nanny.go:108] dnsmasq[9]: read /etc/hosts - 7 addresses

And the sidecar container:

$ kubectl logs --namespace=kube-system $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name) -c sidecar
ERROR: logging before flag.Parse: I0817 20:55:04.488391       1 main.go:48] Version v1.14.3-4-gee838f6
ERROR: logging before flag.Parse: I0817 20:55:04.488612       1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1} {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})
ERROR: logging before flag.Parse: I0817 20:55:04.488667       1 dnsprobe.go:75] Starting dnsProbe {Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
ERROR: logging before flag.Parse: I0817 20:55:04.488766       1 dnsprobe.go:75] Starting dnsProbe {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}

I have mostly been reading the documentation provided here. Any direction, insight or things to try would be much appreciated.

azurepancake
  • 751
  • 1
  • 9
  • 15

8 Answers8

98

I had a similar problem. Restarting the coredns deployment solved it for me:

kubectl -n kube-system rollout restart deployment coredns
Alejandro703
  • 1,160
  • 7
  • 11
  • 2
    coredns was running, I thought it must be good but restarting it worked like a charm, thanks. – Rewanth Tammana Oct 27 '20 at 15:13
  • 1
    My coredns was running fine and has no error, but this command fixed my problem like a charm. I'm so thankful that I decided to login and press an up vote. – Edward Zhang Nov 20 '20 at 05:53
  • This fixed my problem. Thanks for the awesome command! – adarliu Apr 08 '21 at 01:22
  • 1
    This fix my problem. No idea why. – hyperbola Apr 19 '21 at 00:28
  • 3
    Doesn't this make kubernetes useless if it doesn't detect and do this on its own? Might as well not even have liviness or healthchecks if we can't know about and heal from dns failing. – Chris Godwin Aug 27 '21 at 04:16
  • 1
    @ChrisGodwin makes a good point! I think that's the reason we must add monitoring solutions to our clusters. Doing so will enable us to see how many requests are failing for CoreDNS and so forth. Although, I was wondering how to minimize such downtimes? Is there some best practice that must be followed regarding the number of CoreDNS pods in a cluster or something? – Abhinav Thakur Sep 21 '21 at 06:26
  • 1
    I'm really glad I found this solution. I couldn't even resolve it with "Debugging DNS Resolution" guide. It's a shame the pod does not show any error logs. – bluelurker Mar 16 '22 at 23:08
  • I check the old core-dns pods ip range . that is not in my range and restart that resolve my problem . thanks – seyyed sina Oct 16 '22 at 15:49
9

Check coredns pods log, if you see errors like:

# kubectl logs --namespace=kube-system coredns-XXX
  ...
  [ERROR] plugin/errors ... HINFO: read udp ... read: no route to host

Then make sure firewalld masquerade is enabled on the host:

# firewall-cmd --list-all
  ... 
  masquerade: yes

Enable if it's "no":
# firewall-cmd --add-masquerade --permanent
# firewall-cmd --reload

*You may need to restart/reboot after this

atealxt
  • 186
  • 1
  • 6
  • what exactly should to restart? – cryptoparty Aug 12 '20 at 05:45
  • 3
    @cryptoparty restart network like `systemctl restart network` – atealxt Aug 12 '20 at 09:06
  • Your answer fixed the issue and saved my day. I was working on K8S cluster and CoreDNS was erroring. – Sathish Kumar Jul 01 '21 at 17:51
  • Helped me was debugging a whole day but couldn't find what can be done about it. Can you please explain how did you come to this conclusion? An explanation would be really helpful. :) Thanks – maharshi Mar 04 '22 at 21:07
  • This was the one that worked for a Hashicorp Vault deployment on a bare metal RKE2 flavor of Kubernetes running on SuSE 15sp2. Took extensive searching, so I'm dropping some search terms to help others find this. Thanks! – Charlie Reitzel May 11 '22 at 10:42
8

Encountered the same issue. I followed this doc dns-debugging-resolution and checked DNS related pods, services, endpoints, all was running without error messages. Finally, I found my calico service was dead. After I started calico service and waited several minutes, It worked.

gzc
  • 8,180
  • 8
  • 42
  • 62
5

Some ideas come to mind:

Javier Salmeron
  • 8,365
  • 2
  • 28
  • 23
  • Hi, kube-proxy is running: `/usr/local/bin/kube-proxy --kubeconfig=/var/lib/kube-proxy/kubeconfig.conf --cluster-cidr=10.244.0.0/16`. Per [this](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/#is-kube-proxy-writing-iptables-rules), the proxy is creating the needed iptables rules on the node. One odd thing, running `curl 10.111.133.184:80` against service sometimes returns a pod name, other times I get "no route to host". I did choose Flannel when going through `kubeadm` setup. Can you point in the direction on how to test it for sure? – azurepancake Aug 22 '17 at 17:32
  • Ok, so if I understand this correctly, Flannel provides an overlay network so pods can communicate with each other across different nodes. My `kube-dns-2425271678-k1nft` pod has a IP of `10.244.0.5`. This is an IP that is handled by Flannel (defined in `/run/flannel/subnet.env`). This DNS pod is on node `kubemaster`. Now the pods that I create are on node `kubenode01`. I can ping `10.244.0.5` from `kubemaster`, but can't from `kubenode01`. Would I possibly be right to assume that this could likely be a problem with Flannel? I'm so wet behind the ears, so apologies for dumb questions. – azurepancake Aug 22 '17 at 20:32
  • To add a little more in regards to testing the proxies.. I can create two pods running nginx, add them to a service with a `Port` of `80' and a `NodePort` of `31746`. I can then access that service externally by using that node's external IP address along with the above port. Does this prove that kube-proxy is at least functioning, as it seems to successfully be forwarding that traffic over to the pods? – azurepancake Aug 22 '17 at 21:18
  • I am almost 100% sure that this problem is with the overlay. Running `kubectl exec --namespace kube-system -it kube-dns-2425271678-k1nft --container kubedns -- ping google.com` works great, as kube-dns is running on the master. It also works on any other pod running on the master, like kube-flannel, kube-proxy, etc.. However, any pods that are created on the workers cannot resolve DNS, likely because they can't reach the DNS service due to some problem with Flannel. I'll begin researching how to narrow this down, but any advice would be much appreciated! – azurepancake Aug 23 '17 at 12:21
  • Looks like the internal ClusterIPs are not working. Can you ping using the pod IP (not the service IP). – Javier Salmeron Aug 24 '17 at 07:54
  • 2
    So I restarted the docker and kubelet services on my master and nodes, and what do you know.. everything starts working.. I have no idea what I did, or why it wasn't working, but refreshing those two services across the board somehow resolved the problem. I wish I had done that in the first place! Thanks a bunch of giving me a helping hand here. I definitely got to learn a little more about how Kubernetes works :) – azurepancake Aug 24 '17 at 22:02
1

I used kubectl -n kube-system rollout restart deployment coredns to fix the problem, but the next problem is that each time a new node is added to the cluster I have to restart coredns.

jhonsfran
  • 21
  • 1
  • 4
  • Please provide additional details in your answer. As it's currently written, it's hard to understand your solution. – Community Aug 29 '21 at 04:10
0

I add my solution even if the question is quite old. I had the same problem, but in this case the public DNS servers were unreachable due to network policies in the firewall. For solving that I edited the config map used by coredns

kubectl -n kube-system edit configmaps coredns -o yaml

Then I changed the forward option putting inside a public if of a firewall allowed DNS.

Then I restarted the DNS service.

kubectl -n kube-system rollout restart deployment coredns
Christian Sicari
  • 121
  • 2
  • 13
0

I faced a similar problem when I raised a cluster on Virtual box. It turned out that my flannel looked at interface 10.0.2.15

kubectl get pod --namespace kube-system -l app=flannel

NAME                    READY   STATUS     RESTARTS   AGE
kube-flannel-ds-5dxdm   1/1     Running    0          10s
kube-flannel-ds-7z6jt   1/1     Running    0          6s
kube-flannel-ds-vqwrl   1/1     Running    0          3s

and than...

kubectl logs --namespace kube-system kube-flannel-ds-5dxdm -c kube-flannel

I0622 17:53:13.690431       1 main.go:463] Found network config - Backend type: vxlan
I0622 17:53:13.690716       1 match.go:248] Using interface with name enp0s3 and address 10.0.2.15
I0622 17:53:13.690734       1 match.go:270] Defaulting external address to interface address (10.0.2.15)

I added to args --iface=enp0s8

kubectl edit DaemonSet/kube-flannel-ds --namespace kube-system


  containers:
  - name: kube-flannel
    image: quay.io/coreos/flannel:v0.10.0-amd64
    command:
    - /opt/bin/flanneld
    args:
    - --ip-masq
    - --kube-subnet-mgr
    - --iface=enp0s8

and this threads helped me found a solution: configuring flannel to use a non default interface in kubernetes https://github.com/flannel-io/flannel/blob/master/Documentation/troubleshooting.md

and after that coredns works fine

kubectl exec -i -t dnsutils -- nslookup kubernetes.default
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1

and

kubectl logs --namespace=kube-system -l k8s-app=kube-dns
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.8.6
linux/amd64, go1.17.1, 13a9191
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
[INFO] 127.0.0.1:38619 - 51020 "HINFO IN 2350959537417504421.4590630780106405557. udp 57 false 512" NOERROR qr,rd,ra 132 0.055869098s
[INFO] 10.244.2.9:38352 - 33723 "A IN kubernetes.default.default.svc.cluster.local. udp 62 false 512" NXDOMAIN qr,aa,rd 155 0.000133217s
[INFO] 10.244.2.9:34998 - 21047 "A IN kubernetes.default.svc.cluster.local. udp 54 false 512" NOERROR qr,aa,rd 106 0.000088032s
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.8.6
0

My case was the simpliest of all, i guess: backend init container could not reach postgress pod by its hostname, because the pod hostname changed when i repacked it with Helm. In other words: hostname i was looking for was wrong.

Some details:

  • I configured an initContainer within backend pod to check DB availability before starting backend app. That worked fine:
    ...
        initContainers:
        - name: wait-for-db
            image: postgres:13-alpine
            command: [ "sh", "-c", "until pg_isready -h db -p 5432 -U postgres:postgres; do echo 'not yet'; sleep 2; done" ]
    ...
    
  • Then i repacked my app with DB in a Helm chart within different pods, so the template for backend looked like this:
    ...
        initContainers:
        - name: wait-for-db
            image: {{ $db_info.image }}:{{ $db_info.version }} 
            command: [ "sh", "-c", "until pg_isready -h db -p {{ (first $db_info.service.ports).port }} -U postgres:postgres; do echo 'not yet'; sleep 2; done" ]
    ...
    
  • The only problem was, that Helm adds chart name to pod name, so the name of my DB pod changed from db-0 to myfancyapp-db-0, and init container couldn't reach it.
  • The solution was to add Release.name to database hostname in the template, so it would look like this:
    ...
        initContainers:
        - name: wait-for-db
            image: {{ $db_info.image }}:{{ $db_info.version }} 
            command: [ "sh", "-c", "until pg_isready -h {{ .Release.Name }}-db -p {{ (first $db_info.service.ports).port }} -U postgres:postgres; do echo 'not yet'; sleep 2; done" ]
    ...
    
    Notice the change -h db to -h {{ .Release.Name }}-db

Thanks to other people in the topic: they mentioned, that it could be something with hostname resolving, that gave me a clue, that the problem could be with the hostname itself. And the thing with Helm might me not obvious whe you are doing your first steps with Kuber/Helm, like myself.

runout
  • 41
  • 5