2

I tried to install Kubernetes with kubeadm on 3 virtual machines with Debian OS on my laptop, one as master node and the other two as worker nodes. I did exactly as the tutorials on kubernetes.io suggests. I initialized cluster with command kubeadm init --pod-network-cidr=10.244.0.0/16 and joined the workers with corresponding kube join command. I installed Flannel as the network overlay with command kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml.

The repsonse of command kubectl get nodes looks fine:

NAME        STATUS   ROLES    AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE
k8smaster   Ready    master   20h   v1.18.3   192.168.1.100   <none>        Debian GNU/Linux 10 (buster)   4.19.0-9-amd64   docker://19.3.9
k8snode1    Ready    <none>   20h   v1.18.3   192.168.1.101   <none>        Debian GNU/Linux 10 (buster)   4.19.0-9-amd64   docker://19.3.9
k8snode2    Ready    <none>   20h   v1.18.3   192.168.1.102   <none>        Debian GNU/Linux 10 (buster)   4.19.0-9-amd64   docker://19.3.9

The response of command kubectl get pods --all-namespaces doesn't show any error:

NAMESPACE     NAME                                READY   STATUS    RESTARTS   AGE    IP              NODE        NOMINATED NODE   READINESS GATES
kube-system   coredns-66bff467f8-7hlnp             1/1     Running   9          20h    10.244.0.22     k8smaster   <none>           <none>
kube-system   coredns-66bff467f8-wmvx4             1/1     Running   11         20h    10.244.0.23     k8smaster   <none>           <none>
kube-system   etcd-k8smaster                      1/1     Running   11         20h    192.168.1.100   k8smaster   <none>           <none>
kube-system   kube-apiserver-k8smaster            1/1     Running   9          20h    192.168.1.100   k8smaster   <none>           <none>
kube-system   kube-controller-manager-k8smaster   1/1     Running   11         20h    192.168.1.100   k8smaster   <none>           <none>
kube-system   kube-flannel-ds-amd64-9c5rr          1/1     Running   17         20h    192.168.1.102   k8snode2    <none>           <none>
kube-system   kube-flannel-ds-amd64-klw2p          1/1     Running   21         20h    192.168.1.101   k8snode1    <none>           <none>
kube-system   kube-flannel-ds-amd64-x7vm7          1/1     Running   11         20h    192.168.1.100   k8smaster   <none>           <none>
kube-system   kube-proxy-jdfzg                    1/1     Running   11         19h    192.168.1.101   k8snode1    <none>           <none>
kube-system   kube-proxy-lcdvb                    1/1     Running   6          19h    192.168.1.102   k8snode2    <none>           <none>
kube-system   kube-proxy-w6jmf                    1/1     Running   11         20h    192.168.1.100   k8smaster   <none>           <none>
kube-system   kube-scheduler-k8smaster            1/1     Running   10         20h    192.168.1.100   k8smaster   <none>           <none>

Then i tried to create a POD with command kubectl apply -f podexample.yml with following content:

apiVersion: v1
kind: Pod
metadata:
  name: example 
spec:
  containers:
  - name: nginx 
    image: nginx

Command kubectl get pods -o wide shows that the POD is created on worker node1 and is in Running state.

NAME      READY   STATUS    RESTARTS   AGE    IP            NODE       NOMINATED NODE   READINESS GATES
example   1/1     Running   0          135m   10.244.1.14   k8snode1   <none>           <none>

The thing is, when i try to connect to the pod with curl -I 10.244.1.14 command i get the following response in master node:

curl: (7) Failed to connect to 10.244.1.14 port 80: Connection timed out

but the same command on the worker node1 responds successfully with:

HTTP/1.1 200 OK
Server: nginx/1.17.10
Date: Sat, 23 May 2020 19:45:05 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 14 Apr 2020 14:19:26 GMT
Connection: keep-alive
ETag: "5e95c66e-264"
Accept-Ranges: bytes

I thought maybe that's because somehow kube-proxy is not running on master node but command ps aux | grep kube-proxy shows that it's running.

root     16747  0.0  1.6 140412 33024 ?        Ssl  13:18   0:04 /usr/local/bin/kube-proxy --config=/var/lib/kube-proxy/config.conf --hostname-override=k8smaster

Then i checked for kernel routing table with command ip route and it shows that packets destined for 10.244.1.0/244 get routed to flannel.

default via 192.168.1.1 dev enp0s3 onlink 
10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1 
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink 
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink 
169.254.0.0/16 dev enp0s3 scope link metric 1000 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.1.0/24 dev enp0s3 proto kernel scope link src 192.168.1.100 

Everything looks fine to me and i don't know what else should i check to see what's the problem. Am i missing something?

UPDATE1:

If i start an NGINX container on worker node1 and map it's 80 port to port 80 of the worker node1 host, then i can connect to it via command curl -I 192.168.1.101 from master node. Also, i didn't add any iptable rule and there is no firewall daemon like UFW installed on the machines. So, i think it's not a firewall issue.

UPDATE2:

I recreated the cluster and used canal instead of flannel, still no luck.

UPDATE3:

I took a look at canal and flannel logs with following commands and everything seems fine:

kubectl logs -n kube-system canal-c4wtk calico-node
kubectl logs -n kube-system canal-c4wtk kube-flannel
kubectl logs -n kube-system canal-b2fkh calico-node
kubectl logs -n kube-system canal-b2fkh kube-flannel 

UPDATE4:

for the sake of completeness, here are the logs of mentioned containers.

UPDATE5:

I tried to install specific version of kubernetes components and docker, to check if there is an issue related to versioning mismatch with following commands:

sudo apt-get install docker-ce=18.06.1~ce~3-0~debian
sudo apt-get install -y kubelet=1.12.2-00 kubeadm=1.12.2-00 kubectl=1.12.2-00 kubernetes-cni=0.6.0-00
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/bc79dd1505b0c8681ece4de4c0d86c5cd2643275/Documentation/kube-flannel.yml

but nothing changed.

i even updated file /etc/bash.bashrc on all nodes to clear any proxy settings just to make sure it's not about proxy:

export HTTP_PROXY=
export http_proxy=
export NO_PROXY=127.0.0.0/8,192.168.0.0/16,172.0.0.0/8,10.0.0.0/8

and also added following environments to docker systemd file /lib/systemd/system/docker.service on all nodes:

Environment="HTTP_PROXY="
Environment="NO_PROXY="

Then rebooted all nodes and when i logged in, still got curl: (7) Failed to connect to 10.244.1.12 port 80: Connection timed out

UPDATE6:

i event tried to setup the cluster in CentOS machines. thought maybe there is something related to Debian. i also stopped and disabled firewalld to make sure that firewall is not causing problem, but i got the exact same result again: Failed to connect to 10.244.1.2 port 80: Connection timed out.

The only thing that now i'm suspicious about is that maybe it's all because of VirtualBox and virtual machines network configuration? The virtual machines are attched to a Bridge Adapter connected to my Wireless network interface.

UPDATE7:

I went inside the created POD and figured out there is no internet connectivity inside the POD. So, I created another POD from a NGINX image that has commands like curl, wget, ping and traceroute and tried curl https://www.google.com -I and got result: curl: (6) Could not resolve host: www.google.com. I checked /etc/resolv.conf file and found that the DNS server address inside the POD is 10.96.0.10. Changed the DNS to 8.8.8.8 still curl https://www.google.com -I results in curl: (6) Could not resolve host: www.google.com. Tried to ping 8.8.8.8 and the result is 56 packets transmitted, 0 received, 100% packet loss, time 365ms. For the last step i tried traceroute 8.8.8.8 and got the following result:

 1  10.244.1.1 (10.244.1.1)  0.116 ms  0.056 ms  0.052 ms
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *

I don't know the fact that there is no internet connectivity in POD has anything to do with the problem that i can't connect to POD within the cluster from nodes other than the one that POD is deployed on.

chubock
  • 834
  • 8
  • 16
  • I remember having this issues like a year back, i fixed it by chaning it from flannel to canal, maybe try that – paltaa May 23 '20 at 20:18
  • Check if there is a firewall on the worker nodes, and if they accept traffic from your node. – Burak Serdar May 23 '20 at 20:18
  • @BurakSerdar if i start an nginx container by docker on worker node1 and map port 80 of the worker node1 to port 80 of the created container, i can connect to the nginx on worker node1 from master node with `curl -I 192.168.1.101`. So, i think it's not about firewall. also i didn't set any firewall rule with `iptables` and there is no `UFW` installed on the machines. – chubock May 23 '20 at 20:26
  • can you reach a pod on one worker node from the other one? – suren May 23 '20 at 21:07
  • @suren no, the pod is only accessible within worker node1. worker node2 get the same response as master node. – chubock May 23 '20 at 21:16
  • Where is it running? I mean on the cloud? You need to open up `ipip` protocol. – suren May 23 '20 at 22:17
  • @suren machines are running on my laptop on `VirtualBox` – chubock May 23 '20 at 22:23
  • @paltaa i recreated the cluster with canal this time, no luck. – chubock May 23 '20 at 23:07
  • any information from the flannel or canal logs ? – paltaa May 23 '20 at 23:09
  • @paltaa i don't know where canal logs are persisted. should i configure it in someway? – chubock May 23 '20 at 23:14
  • @paltaa i tried with `kubectl logs -n kube-system canal-b2fkh kube-flannel | grep ERROR` and `kubectl logs -n kube-system canal-b2fkh calico-node | grep ERROR` and `kubectl logs -n kube-system canal-c4wtk calico-node | grep ERROR` and `kubectl logs -n kube-system canal-b2fkh kube-flannel | grep ERROR` and found nothing. – chubock May 23 '20 at 23:20
  • how are you spinning up the cluster ? maybe dont grep and just post the logs and lets see whats happening, its clear that is a network problem – paltaa May 23 '20 at 23:30
  • @paltaa i uploaded the log files and updated the answer. as i said, there is nothing suspicious in logs. – chubock May 24 '20 at 00:00
  • Which cri are you using, docker or containerd? – Elgarni May 24 '20 at 13:21
  • @Elgarni its `Docker` – chubock May 24 '20 at 15:39
  • Try `kubectl restart docker` on all nodes (starting from master). – Elgarni May 25 '20 at 11:21
  • If that does not work, try to watch the logs of kube proxy on the node a pod is running on, and then do any curl inside the pod, and share the logs outputted by the kube proxy – Elgarni May 25 '20 at 11:21
  • @Elgarni `kubectl restart docker` results with `Error: unknown command "restart" for "kubectl"`. i tried to execute `curl` inside the pod but it wasn't installed in `nginx` image. i tried to use `wget` it wasn't installed neither. tried to install it with `apt-get` figured out that POD can't connect to the internet. – chubock May 25 '20 at 13:32
  • My bad. I meant `systemctl`, restarting the service – Elgarni May 25 '20 at 13:57
  • @Elgarni tnx for your comment. i went inside the pod and explored a little bit. and i found that seems like i have DNS issues. – chubock May 25 '20 at 14:07
  • 1
    I have experienced exact same behavior with my kubeadm cluster on gcp with calico CNI. What solved my issue is to set the iptables to legacy (sometimes suggested for Debian): `sudo update-alternatives --set iptables /usr/sbin/iptables-legacy` and `sudo update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy`. After that I deployed again calico and rebooted the all servers.I still don`t know why this fixed the the issue. Can you try and test this? Meanwhile I will try to reproduce this error in virtualbox. – acid_fuji May 26 '20 at 14:19
  • While debugging I have traced the packets arrived at another server but somehow did not go through to the pod. I suspect is that CNI is picking up the wrong interface. [This case](https://stackoverflow.com/questions/47845739/configuring-flannel-to-use-a-non-default-interface-in-kubernetes) explains that flannel picks the first interface on a host. Can you check and verify which interface flannel is using or perhaps point him the right one as suggested in the topic. – acid_fuji May 26 '20 at 14:26
  • 1
    @acid_fuji OMG, this command `sudo update-alternatives --set iptables /usr/sbin/iptables-legacy` solved the issue. thanks man. please post an answer and i'll mark it as accepted. – chubock May 26 '20 at 16:22

1 Answers1

2

Debian system uses nftables for the iptables backend which is not compatible with Kubernetes network setup. So you have to set it to use iptables-legacy instead of nftables with the following commands:

sudo update-alternatives --set iptables /usr/sbin/iptables-legacy 
sudo update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy
acid_fuji
  • 6,287
  • 7
  • 22
  • 1
    seems like there are two sets of modules for packet filtering in kernel: `ip_tables`, and `nf_tables`. until recently `iptables` were using `ip_tables` module, but in `iptables 1.8` they have deprecated `ip_tables` and using `nf_tables` under the hood. this problem arises when somehow `iptables 1.6` and `iptables 1.8` get invoked on the same host. – chubock May 27 '20 at 04:39