5

I got Kubernetes Cluster with 1 master and 3 workers nodes.

calico v3.7.3 kubernetes v1.16.0 installed via kubespray https://github.com/kubernetes-sigs/kubespray

Before that, I normally deployed all the pods without any problems.

I can't start a few pod (Ceph):

kubectl get all --namespace=ceph
NAME                                 READY   STATUS             RESTARTS   AGE
pod/ceph-cephfs-test                 0/1     Pending            0          162m
pod/ceph-mds-665d849f4f-fzzwb        0/1     Pending            0          162m
pod/ceph-mon-744f6dc9d6-jtbgk        0/1     CrashLoopBackOff   24         162m
pod/ceph-mon-744f6dc9d6-mqwgb        0/1     CrashLoopBackOff   24         162m
pod/ceph-mon-744f6dc9d6-zthpv        0/1     CrashLoopBackOff   24         162m
pod/ceph-mon-check-6f474c97f-gjr9f   1/1     Running            0          162m


NAME               TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
service/ceph-mon   ClusterIP   None         <none>        6789/TCP   162m

NAME                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR       AGE
daemonset.apps/ceph-osd   0         0         0       0            0           node-type=storage   162m

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ceph-mds         0/1     1            0           162m
deployment.apps/ceph-mon         0/3     3            0           162m
deployment.apps/ceph-mon-check   1/1     1            1           162m

NAME                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/ceph-mds-665d849f4f        1         1         0       162m
replicaset.apps/ceph-mon-744f6dc9d6        3         3         0       162m
replicaset.apps/ceph-mon-check-6f474c97f   1         1         1       162m

But another obe is ok:

kubectl get pods -n kube-system
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-6d57b44787-xlj89   1/1     Running   19         24d
calico-node-dwm47                          1/1     Running   310        19d
calico-node-hhgzk                          1/1     Running   15         24d
calico-node-tk4mp                          1/1     Running   309        19d
calico-node-w7zvs                          1/1     Running   312        19d
coredns-74c9d4d795-jrxjn                   1/1     Running   0          2d23h
coredns-74c9d4d795-psf2v                   1/1     Running   2          18d
dns-autoscaler-7d95989447-7kqsn            1/1     Running   10         24d
kube-apiserver-master                      1/1     Running   4          24d
kube-controller-manager-master             1/1     Running   3          24d
kube-proxy-9bt8m                           1/1     Running   2          19d
kube-proxy-cbrcl                           1/1     Running   4          19d
kube-proxy-stj5g                           1/1     Running   0          19d
kube-proxy-zql86                           1/1     Running   0          19d
kube-scheduler-master                      1/1     Running   3          24d
kubernetes-dashboard-7c547b4c64-6skc7      1/1     Running   591        24d
nginx-proxy-worker1                        1/1     Running   2          19d
nginx-proxy-worker2                        1/1     Running   0          19d
nginx-proxy-worker3                        1/1     Running   0          19d
nodelocaldns-6t92x                         1/1     Running   2          19d
nodelocaldns-kgm4t                         1/1     Running   0          19d
nodelocaldns-xl8zg                         1/1     Running   0          19d
nodelocaldns-xwlwk                         1/1     Running   12         24d
tiller-deploy-8557598fbc-7f2w6             1/1     Running   0          131m

I use Centos 7:

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

The error log:

Get https://10.2.67.203:10250/containerLogs/ceph/ceph-mon-744f6dc9d6-mqwgb/ceph-mon?tailLines=5000&timestamps=true: dial tcp 10.2.67.203:10250: connect: no route to host

Maybe someone came across this and can help me? I will provide any additional information

logs from pending pods:

Warning FailedScheduling 98s (x125 over 3h1m) default-scheduler 0/4 nodes are available: 4 node(s) didn't match node selector.

Rot-man
  • 18,045
  • 12
  • 118
  • 124
cryptoparty
  • 345
  • 1
  • 5
  • 20
  • I would pay more attention to pending pods. Probably that's why the other ones are crushing. – suren Oct 07 '19 at 11:05
  • Warning FailedScheduling 5m10s (x119 over 3h) default-scheduler 0/4 nodes are available: 4 node(s) didn't match node selector. – cryptoparty Oct 07 '19 at 11:09
  • that means you have a node-selector in your yaml file, but none of your nodes is labeled with that selector, so the scheduled can't schedule the pod on any node. Get the yaml file, get the node-selector and do `kubectl label node NODE key=value`. key=value is your node-selector. – suren Oct 07 '19 at 11:46
  • I did it: kubectl label nodes node-type=storage --all , and now all pods failed: Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "e34272b14a996518cec3895830981fc775a930a95719c4f7b1dc4e6a6ce42f2d" network for pod "ceph-mon-744f6dc9d6-5jjr2": NetworkPlugin cni failed to set up pod "ceph-mon-744f6dc9d6-5jjr2_ceph" network: dial tcp 10.2.67.201:2379: connect: no route to host, failed to clean up sandbox container " – cryptoparty Oct 07 '19 at 12:25
  • Sounds like pod network error. I don't know why. 2379 is etcd, but I can't relate it with the error. if you check the logs of calico, everything seems fine? firewalls are correct? – suren Oct 07 '19 at 13:23
  • 2019-10-07 13:41:10 /opt/ceph-container/bin/entrypoint.sh: k8s: config is stored as k8s secrets. 2019-10-07 13:41:10 /opt/ceph-container/bin/entrypoint.sh: k8s: does not generate the admin key. Use Kubernetes secrets instead. 2019-10-07 13:41:10 /opt/ceph-container/bin/entrypoint.sh: Creating osd unable to get monitor info from DNS SRV with service name: ceph-mon [errno 2] error connecting to the cluster – cryptoparty Oct 07 '19 at 13:45
  • I would say this is a ceph specific issue. I added ceph tag. – suren Oct 07 '19 at 14:13
  • Can you add informations about your kubernetes and calico versions please? – Jakub Oct 08 '19 at 08:48
  • calico v3.7.3, kubernetes 1.16 installed from kubespray https://github.com/kubernetes-sigs/kubespray – cryptoparty Oct 08 '19 at 10:59
  • I updated my kernel, but the same doesnt work Linux master 5.3.6-1.el7.elrepo.x86_64 #1 SMP Fri Oct 11 17:24:39 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux – cryptoparty Oct 21 '19 at 08:59
  • Blindly labelling nodes as `node-type: storage` doesn't fix the underlying problem. The error you see is unrelated to Ceph or to the node labels, but specifically states that your CNI provider (Calico) was unable to set up the Pod's network. Since your `calico-node` pods, which are responsible for configuring the pod's network, are crash-looping it is likely that they are the underlying cause of the issues you are seeing. How did you deploy Ceph? – chaosaffe Nov 08 '19 at 17:44
  • Did you solve the issue ? If you did, what exactly did you do ? – Dime Jun 23 '22 at 19:17

2 Answers2

10

It seems that a firewall is blocking ingress traffic from port 10250 on the 10.2.67.203 node.

You can open it by running the commands below (I'm assuming firewalld is installed or you can run the commands of the equivalent firewall module):

sudo firewall-cmd --add-port=10250/tcp --permanent
sudo firewall-cmd --reload
sudo firewall-cmd --list-all  # you should see that port `10250` is updated
alper
  • 2,919
  • 9
  • 53
  • 102
Rot-man
  • 18,045
  • 12
  • 118
  • 124
4

tl;dr; It looks like your cluster itself is fairly broken and should be repaired before looking at Ceph specifically

Get https://10.2.67.203:10250/containerLogs/ceph/ceph-mon-744f6dc9d6-mqwgb/ceph-mon?tailLines=5000&timestamps=true: dial tcp 10.2.67.203:10250: connect: no route to host

10250 is the port that the Kubernetes API server uses to connect to a node's Kubelet to retrieve the logs.

This error indicates that the Kubernetes API server is unable to reach the node. This has nothing to do with your containers, pods or even your CNI network. no route to host indicates that either:

  1. The host is unavailable
  2. A network segmentation has occurred
  3. The Kubelet is unable to answer the API server

Before addressing issues with the Ceph pods I would investigate why the Kubelet isn't reachable from the API server.

After you have solved the underlying network connectivity issues I would address the crash-looping Calico pods (You can see the logs of the previously executed containers by running kubectl logs -n kube-system calico-node-dwm47 -p).

Once you have both the underlying network and the pod network sorted I would address the issues with the Kubernetes Dashboard crash-looping, and finally, start to investigate why you are having issues deploying Ceph.

chaosaffe
  • 848
  • 9
  • 22
  • ``kubectl get pod -ns kube-system calico-node-dwm47 -p calico-node-dwm47 1/1 Running 311 60d`` and for a long time there were no logs with errors in api logs. But it was `` http: TLS handshake error from 10.2.67.26:49312: remote error: tls: bad certificate\n`` – cryptoparty Nov 18 '19 at 07:44
  • and i get ``curl -X GET https://10.2.67.201:6443/healthz -k ok`` – cryptoparty Nov 18 '19 at 08:02
  • and this port ``curl -v -i 10.2.67.201:10250 * About to connect() to 10.2.67.201 port 10250 (#0) * Trying 10.2.67.201... * Connected to 10.2.67.201 (10.2.67.201) port 10250 (#0) > GET / HTTP/1.1`` – cryptoparty Nov 18 '19 at 08:08
  • My apologies, I gave you the wrong command, you should use `kubectl logs -p`, not `kubectl get pods` – chaosaffe Nov 18 '19 at 08:21
  • In addition to checking the `kube-apiserver` logs you should also check the `kubelet` logs on an affected host – chaosaffe Nov 18 '19 at 08:28
  • only the one WARN entry - ``2019-11-18 08:42:03.354 [WARNING][53] active_rules_calculator.go 326: Profile not known or invalid, generating dummy profile that drops all traffic. profileID="ksa.ceph.default"`` – cryptoparty Nov 18 '19 at 10:54
  • Well, dropping all traffic means that it won't accept packets, so your pod network won't transit any network traffic if I understand the log entry. Not sure where the log came from, but I would assume it is from the calico-node pods? – chaosaffe Nov 18 '19 at 22:53