How to solve CoreDNS always stuck at "waiting for kubernetes"?

Question

Vagrant, vm os: ubuntu/bionic64, swap disabled

Kubernetes version: 1.18.0

infrastructure: 1 haproxy node, 3 external etcd node and 3 kubernetes master node

Attempts: trying to setup ha rancher so I am setting up ha kubernetes cluster first using kubeadm by following the official doc

Expected behavior: all k8s components are up and be able to navigate to weave scope to see all nodes

Actual behavior: CoreDNS is still not ready even after installing CNI (Weave Net) so weave scope (the nice visualization ui) is not working unless networking is working properly (weave net and coredns).

# kubeadm config
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
kubernetesVersion: stable
controlPlaneEndpoint: "172.16.0.30:6443"
etcd:
  external:
    caFile: /etc/rancher-certs/ca-chain.cert.pem
    keyFile: /etc/rancher-certs/etcd.key.pem
    certFile: /etc/rancher-certs/etcd.cert.pem
    endpoints:
          - https://172.16.0.20:2379
          - https://172.16.0.21:2379
          - https://172.16.0.22:2379

-------------------------------------------------------------------------------

# firewall
vagrant@rancher-0:~$ sudo ufw status
Status: active

To                         Action      From
--                         ------      ----
OpenSSH                    ALLOW       Anywhere
Anywhere                   ALLOW       172.16.0.0/26
OpenSSH (v6)               ALLOW       Anywhere (v6)

-------------------------------------------------------------------------------

# no swap
vagrant@rancher-0:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:           1.9G        928M         97M        1.4M        966M        1.1G
Swap:            0B          0B          0B

k8s diagnostic output:

vagrant@rancher-0:~$ kubectl get nodes -o wide
NAME        STATUS   ROLES    AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
rancher-0   Ready    master   14m     v1.18.0   10.0.2.15     <none>        Ubuntu 18.04.4 LTS   4.15.0-99-generic   docker://19.3.12
rancher-1   Ready    master   9m23s   v1.18.0   10.0.2.15     <none>        Ubuntu 18.04.4 LTS   4.15.0-99-generic   docker://19.3.12
rancher-2   Ready    master   4m26s   v1.18.0   10.0.2.15     <none>        Ubuntu 18.04.4 LTS   4.15.0-99-generic   docker://19.3.12

vagrant@rancher-0:~$ kubectl get services --all-namespaces
NAMESPACE      NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
cert-manager   cert-manager           ClusterIP   10.106.146.236   <none>        9402/TCP                 17m
cert-manager   cert-manager-webhook   ClusterIP   10.102.162.87    <none>        443/TCP                  17m
default        kubernetes             ClusterIP   10.96.0.1        <none>        443/TCP                  18m
kube-system    kube-dns               ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP   18m
weave          weave-scope-app        NodePort    10.96.110.153    <none>        80:30276/TCP             17m

vagrant@rancher-0:~$ kubectl get pods --all-namespaces -o wide
NAMESPACE      NAME                                        READY   STATUS    RESTARTS   AGE     IP          NODE        NOMINATED NODE   READINESS GATES
cert-manager   cert-manager-bd9d585bd-x8qpb                0/1     Pending   0          16m     <none>      <none>      <none>           <none>
cert-manager   cert-manager-cainjector-76c6657c55-d8fpj    0/1     Pending   0          16m     <none>      <none>      <none>           <none>
cert-manager   cert-manager-webhook-64b9b4fdfd-sspjx       0/1     Pending   0          16m     <none>      <none>      <none>           <none>
kube-system    coredns-66bff467f8-9z4f8                    0/1     Running   0          10m     10.32.0.2   rancher-1   <none>           <none>
kube-system    coredns-66bff467f8-zkk99                    0/1     Running   0          16m     10.32.0.2   rancher-0   <none>           <none>
kube-system    kube-apiserver-rancher-0                    1/1     Running   0          16m     10.0.2.15   rancher-0   <none>           <none>
kube-system    kube-apiserver-rancher-1                    1/1     Running   0          12m     10.0.2.15   rancher-1   <none>           <none>
kube-system    kube-apiserver-rancher-2                    1/1     Running   0          7m23s   10.0.2.15   rancher-2   <none>           <none>
kube-system    kube-controller-manager-rancher-0           1/1     Running   0          16m     10.0.2.15   rancher-0   <none>           <none>
kube-system    kube-controller-manager-rancher-1           1/1     Running   0          12m     10.0.2.15   rancher-1   <none>           <none>
kube-system    kube-controller-manager-rancher-2           1/1     Running   0          7m24s   10.0.2.15   rancher-2   <none>           <none>
kube-system    kube-proxy-grts7                            1/1     Running   0          12m     10.0.2.15   rancher-1   <none>           <none>
kube-system    kube-proxy-jv9lm                            1/1     Running   0          16m     10.0.2.15   rancher-0   <none>           <none>
kube-system    kube-proxy-z2lrc                            1/1     Running   0          7m25s   10.0.2.15   rancher-2   <none>           <none>
kube-system    kube-scheduler-rancher-0                    1/1     Running   0          16m     10.0.2.15   rancher-0   <none>           <none>
kube-system    kube-scheduler-rancher-1                    1/1     Running   0          12m     10.0.2.15   rancher-1   <none>           <none>
kube-system    kube-scheduler-rancher-2                    1/1     Running   0          7m23s   10.0.2.15   rancher-2   <none>           <none>
kube-system    weave-net-nnvkd                             2/2     Running   0          7m25s   10.0.2.15   rancher-2   <none>           <none>
kube-system    weave-net-pgxnq                             2/2     Running   0          12m     10.0.2.15   rancher-1   <none>           <none>
kube-system    weave-net-q22bh                             2/2     Running   0          16m     10.0.2.15   rancher-0   <none>           <none>
weave          weave-scope-agent-9gwj2                     1/1     Running   0          16m     10.0.2.15   rancher-0   <none>           <none>
weave          weave-scope-agent-mznp7                     1/1     Running   0          7m25s   10.0.2.15   rancher-2   <none>           <none>
weave          weave-scope-agent-v7jql                     1/1     Running   0          12m     10.0.2.15   rancher-1   <none>           <none>
weave          weave-scope-app-bc7444d59-cjpd8             0/1     Pending   0          16m     <none>      <none>      <none>           <none>
weave          weave-scope-cluster-agent-5c5dcc8cb-ln4hg   0/1     Pending   0          16m     <none>      <none>      <none>           <none>

vagrant@rancher-0:~$ kubectl describe node rancher-0
Name:               rancher-0
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=rancher-0
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/master=
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 28 Jul 2020 09:24:17 +0000
Taints:             node-role.kubernetes.io/master:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  rancher-0
  AcquireTime:     <unset>
  RenewTime:       Tue, 28 Jul 2020 09:35:33 +0000
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Tue, 28 Jul 2020 09:24:47 +0000   Tue, 28 Jul 2020 09:24:47 +0000   WeaveIsUp                    Weave pod has set this
  MemoryPressure       False   Tue, 28 Jul 2020 09:35:26 +0000   Tue, 28 Jul 2020 09:24:17 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 28 Jul 2020 09:35:26 +0000   Tue, 28 Jul 2020 09:24:17 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 28 Jul 2020 09:35:26 +0000   Tue, 28 Jul 2020 09:24:17 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 28 Jul 2020 09:35:26 +0000   Tue, 28 Jul 2020 09:24:52 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.0.2.15
  Hostname:    rancher-0
Capacity:
  cpu:                2
  ephemeral-storage:  10098432Ki
  hugepages-2Mi:      0
  memory:             2040812Ki
  pods:               110
Allocatable:
  cpu:                2
  ephemeral-storage:  9306714916
  hugepages-2Mi:      0
  memory:             1938412Ki
  pods:               110
System Info:
  Machine ID:                 9b1bc8a8ef2c4e5b844624a36302d877
  System UUID:                A282600C-28F8-4D49-A9D3-6F05CA16865E
  Boot ID:                    77746bf5-7941-4e72-817e-24f149172158
  Kernel Version:             4.15.0-99-generic
  OS Image:                   Ubuntu 18.04.4 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.12
  Kubelet Version:            v1.18.0
  Kube-Proxy Version:         v1.18.0
Non-terminated Pods:          (7 in total)
  Namespace                   Name                                 CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                 ------------  ----------  ---------------  -------------  ---
  kube-system                 coredns-66bff467f8-zkk99             100m (5%)     0 (0%)      70Mi (3%)        170Mi (8%)     11m
  kube-system                 kube-apiserver-rancher-0             250m (12%)    0 (0%)      0 (0%)           0 (0%)         11m
  kube-system                 kube-controller-manager-rancher-0    200m (10%)    0 (0%)      0 (0%)           0 (0%)         11m
  kube-system                 kube-proxy-jv9lm                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m
  kube-system                 kube-scheduler-rancher-0             100m (5%)     0 (0%)      0 (0%)           0 (0%)         11m
  kube-system                 weave-net-q22bh                      20m (1%)      0 (0%)      0 (0%)           0 (0%)         11m
  weave                       weave-scope-agent-9gwj2              100m (5%)     0 (0%)      100Mi (5%)       2000Mi (105%)  11m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                770m (38%)  0 (0%)
  memory             170Mi (8%)  2170Mi (114%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:
  Type     Reason                   Age                From                   Message
  ----     ------                   ----               ----                   -------
  Normal   Starting                 11m                kubelet, rancher-0     Starting kubelet.
  Warning  ImageGCFailed            11m                kubelet, rancher-0     failed to get imageFs info: unable to find data in memory cache
  Normal   NodeHasSufficientMemory  11m (x3 over 11m)  kubelet, rancher-0     Node rancher-0 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    11m (x3 over 11m)  kubelet, rancher-0     Node rancher-0 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     11m (x2 over 11m)  kubelet, rancher-0     Node rancher-0 status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  11m                kubelet, rancher-0     Updated Node Allocatable limit across pods
  Normal   Starting                 11m                kubelet, rancher-0     Starting kubelet.
  Normal   NodeHasSufficientMemory  11m                kubelet, rancher-0     Node rancher-0 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    11m                kubelet, rancher-0     Node rancher-0 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     11m                kubelet, rancher-0     Node rancher-0 status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  11m                kubelet, rancher-0     Updated Node Allocatable limit across pods
  Normal   Starting                 11m                kube-proxy, rancher-0  Starting kube-proxy.
  Normal   NodeReady                10m                kubelet, rancher-0     Node rancher-0 status is now: NodeReady

vagrant@rancher-0:~$ kubectl exec -n kube-system weave-net-nnvkd -c weave -- /home/weave/weave --local status

        Version: 2.6.5 (failed to check latest version - see logs; next check at 2020/07/28 15:27:34)

        Service: router
       Protocol: weave 1..2
           Name: 5a:40:7b:be:35:1d(rancher-2)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 0
    Connections: 0
          Peers: 1
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.32.0.0/12
  DefaultSubnet: 10.32.0.0/12

vagrant@rancher-0:~$ kubectl logs weave-net-nnvkd -c weave -n kube-system
INFO: 2020/07/28 09:34:15.989759 Command line options: map[conn-limit:200 datapath:datapath db-prefix:/weavedb/weave-net docker-api: expect-npc:true host-root:/host http-addr:127.0.0.1:6784 ipalloc-init:consensus=0 ipalloc-range:10.32.0.0/12 metrics-addr:0.0.0.0:6782 name:5a:40:7b:be:35:1d nickname:rancher-2 no-dns:true port:6783]
INFO: 2020/07/28 09:34:15.989792 weave  2.6.5
INFO: 2020/07/28 09:34:16.178429 Bridge type is bridged_fastdp
INFO: 2020/07/28 09:34:16.178451 Communication between peers is unencrypted.
INFO: 2020/07/28 09:34:16.182442 Our name is 5a:40:7b:be:35:1d(rancher-2)
INFO: 2020/07/28 09:34:16.182499 Launch detected - using supplied peer list: []
INFO: 2020/07/28 09:34:16.196598 Checking for pre-existing addresses on weave bridge
INFO: 2020/07/28 09:34:16.204735 [allocator 5a:40:7b:be:35:1d] No valid persisted data
INFO: 2020/07/28 09:34:16.206236 [allocator 5a:40:7b:be:35:1d] Initialising via deferred consensus
INFO: 2020/07/28 09:34:16.206291 Sniffing traffic on datapath (via ODP)
INFO: 2020/07/28 09:34:16.210065 Listening for HTTP control messages on 127.0.0.1:6784
INFO: 2020/07/28 09:34:16.210471 Listening for metrics requests on 0.0.0.0:6782
INFO: 2020/07/28 09:34:16.275523 Error checking version: Get https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=4.15.0-99-generic&flag_kubernetes-cluster-size=0&flag_kubernetes-cluster-uid=aca5a8cc-27ca-4e8f-9964-4cf3971497c6&flag_kubernetes-version=v1.18.6&os=linux&signature=7uMaGpuc3%2F8ZtHqGoHyCnJ5VfOJUmnL%2FD6UZSqWYxKA%3D&version=2.6.5: dial tcp: lookup checkpoint-api.weave.works on 10.96.0.10:53: write udp 10.0.2.15:43742->10.96.0.10:53: write: operation not permitted
INFO: 2020/07/28 09:34:17.052454 [kube-peers] Added myself to peer list &{[{96:cd:5b:7f:65:73 rancher-1} {5a:40:7b:be:35:1d rancher-2}]}
DEBU: 2020/07/28 09:34:17.065599 [kube-peers] Nodes that have disappeared: map[96:cd:5b:7f:65:73:{96:cd:5b:7f:65:73 rancher-1}]
DEBU: 2020/07/28 09:34:17.065836 [kube-peers] Preparing to remove disappeared peer 96:cd:5b:7f:65:73
DEBU: 2020/07/28 09:34:17.079511 [kube-peers] Noting I plan to remove  96:cd:5b:7f:65:73
DEBU: 2020/07/28 09:34:17.095598 weave DELETE to http://127.0.0.1:6784/peer/96:cd:5b:7f:65:73 with map[]
INFO: 2020/07/28 09:34:17.097095 [kube-peers] rmpeer of 96:cd:5b:7f:65:73: 0 IPs taken over from 96:cd:5b:7f:65:73

DEBU: 2020/07/28 09:34:17.644909 [kube-peers] Nodes that have disappeared: map[]
INFO: 2020/07/28 09:34:17.658557 Assuming quorum size of 1
10.32.0.1
DEBU: 2020/07/28 09:34:17.761697 registering for updates for node delete events

vagrant@rancher-0:~$ kubectl logs coredns-66bff467f8-9z4f8 -n kube-system
.:53
[INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7
CoreDNS-1.6.7
linux/amd64, go1.13.6, da7f65b
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
I0728 09:31:10.764496       1 trace.go:116] Trace[2019727887]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105 (started: 2020-07-28 09:30:40.763691008 +0000 UTC m=+0.308910646) (total time: 30.000692218s):
Trace[2019727887]: [30.000692218s] [30.000692218s] END
E0728 09:31:10.764526       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
I0728 09:31:10.764666       1 trace.go:116] Trace[1427131847]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105 (started: 2020-07-28 09:30:40.761333538 +0000 UTC m=+0.306553222) (total time: 30.00331917s):
Trace[1427131847]: [30.00331917s] [30.00331917s] END
E0728 09:31:10.764673       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
I0728 09:31:10.767435       1 trace.go:116] Trace[939984059]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105 (started: 2020-07-28 09:30:40.762085835 +0000 UTC m=+0.307305485) (total time: 30.005326233s):
Trace[939984059]: [30.005326233s] [30.005326233s] END
E0728 09:31:10.767569       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
...

vagrant@rancher-0:~$ kubectl describe pod coredns-66bff467f8-9z4f8 -n kube-system
Name:                 coredns-66bff467f8-9z4f8
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 rancher-1/10.0.2.15
Start Time:           Tue, 28 Jul 2020 09:30:38 +0000
Labels:               k8s-app=kube-dns
                      pod-template-hash=66bff467f8
Annotations:          <none>
Status:               Running
IP:                   10.32.0.2
IPs:
  IP:           10.32.0.2
Controlled By:  ReplicaSet/coredns-66bff467f8
Containers:
  coredns:
    Container ID:  docker://899cfd54a5281939dcb09eece96ff3024a3b4c444e982bda74b8334504a6a369
    Image:         k8s.gcr.io/coredns:1.6.7
    Image ID:      docker-pullable://k8s.gcr.io/coredns@sha256:2c8d61c46f484d881db43b34d13ca47a269336e576c81cf007ca740fa9ec0800
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Running
      Started:      Tue, 28 Jul 2020 09:30:40 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-znl2p (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-znl2p:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-znl2p
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                    From                Message
  ----     ------     ----                   ----                -------
  Normal   Scheduled  28m                    default-scheduler   Successfully assigned kube-system/coredns-66bff467f8-9z4f8 to rancher-1
  Normal   Pulled     28m                    kubelet, rancher-1  Container image "k8s.gcr.io/coredns:1.6.7" already present on machine
  Normal   Created    28m                    kubelet, rancher-1  Created container coredns
  Normal   Started    28m                    kubelet, rancher-1  Started container coredns
  Warning  Unhealthy  3m35s (x151 over 28m)  kubelet, rancher-1  Readiness probe failed: HTTP probe failed with statuscode: 503

Edit 0:

The issue is solved, so the problem was that I configure ufw rule to allow cidr of my vms network but does not allow from kubernetes(from docker containers), so I configure ufw to allow certain ports documented from kubernetes website and ports documented from weave website so now the cluster is working as expected

Is there a firewall blocking connection to Kubernetes API server? — Arghya Sadhu, Jul 28 '20 at 10:58
hello @ArghyaSadhu, could you please tell me how to verify that? From `haproxy` node, I could do `curl https://rancher-0:6443 -k` and it will return me an `ok` as response back. — shadowlegend, Jul 28 '20 at 13:52
hello @Malgorzata, I did not solve it, but I tried to use those same playbook but provision on google cloud instead and it works as expected. I searched on google before and i think it was issue with multiple interfaces that vagrant has — shadowlegend, Mar 13 '21 at 03:09

score 2 · Answer 1 · answered Mar 22 '21 at 09:22

As @shadowlegend said the issue is solved, so the problem was with configuration ufw rule to allow cidr of vms network but does not allow from kubernetes(from docker containers). Configure ufw to allow certain ports documented from kubernetes website and ports documented from weave website and the cluster will be working as expected.

Take a look: ufw-firewall-kubernetes.

NOTE:

Those same playbook work as expected on google cloud.

How to solve CoreDNS always stuck at "waiting for kubernetes"?

1 Answers1