5

I'm setting up an on-premise kubernetes cluster with kubeadm.

Here is the Kubernestes version

clientVersion:
  buildDate: "2022-10-12T10:57:26Z"
  compiler: gc
  gitCommit: 434bfd82814af038ad94d62ebe59b133fcb50506
  gitTreeState: clean
  gitVersion: v1.25.3
  goVersion: go1.19.2
  major: "1"
  minor: "25"
  platform: linux/amd64
kustomizeVersion: v4.5.7
serverVersion:
  buildDate: "2022-10-12T10:49:09Z"
  compiler: gc
  gitCommit: 434bfd82814af038ad94d62ebe59b133fcb50506
  gitTreeState: clean
  gitVersion: v1.25.3
  goVersion: go1.19.2
  major: "1"
  minor: "25"
  platform: linux/amd64

I have installed metallb version 0.13.7

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.7/config/manifests/metallb-native.yaml

Everything is running

$ kubectl get all -n metallb-system
 
NAME                              READY   STATUS    RESTARTS   AGE
pod/controller-84d6d4db45-l2r55   1/1     Running   0          35s
pod/speaker-48qn4                 1/1     Running   0          35s
pod/speaker-ds8hh                 1/1     Running   0          35s
pod/speaker-pfbcp                 1/1     Running   0          35s
pod/speaker-st7n2                 1/1     Running   0          35s

NAME                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/webhook-service   ClusterIP   10.104.14.119   <none>        443/TCP   35s

NAME                     DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/speaker   4         4         4       4            4           kubernetes.io/os=linux   35s

NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/controller   1/1     1            1           35s

NAME                                    DESIRED   CURRENT   READY   AGE
replicaset.apps/controller-84d6d4db45   1         1         1       35s

But when i try to apply an IPaddressPool CRD i get an error

kubectl apply -f ipaddresspool.yaml

ipaddresspool.yaml file content

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: first-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.2.100-192.168.2.199

The error is a fail to call the validation webhook no route to host

Error from server (InternalError): error when creating "ipaddresspool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": dial tcp 10.104.14.119:443: connect: no route to host

Here is the same error with line brakes

Error from server (InternalError): 
error when creating "ipaddresspool.yaml": 
Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": 
failed to call webhook: 
Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": 
dial tcp 10.104.14.119:443: connect: no route to host

The IP -address is correct

NAME              TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
webhook-service   ClusterIP   10.104.14.119   <none>        443/TCP   18m

I have also tried installing metallb v 0.13.7 using helm but with the same result

Does someone know why the webhook cannot be called?

EDIT

As an answer to Thomas question, here is the description for webhook-service. NOTE that this is from another cluster with the same problem because I deleted the last cluster so the IP is not the same as last time

$ kubectl describe svc webhook-service -n metallb-system

Name:              webhook-service
Namespace:         metallb-system
Labels:            <none>
Annotations:       <none>
Selector:          component=controller
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.105.157.72
IPs:               10.105.157.72
Port:              <unset>  443/TCP
TargetPort:        9443/TCP
Endpoints:         172.17.0.3:9443
Session Affinity:  None
Events:            <none>
F1ko
  • 3,326
  • 1
  • 9
  • 24
AxdorphCoder
  • 1,104
  • 2
  • 15
  • 26
  • Could you add the output from `kubectl describe svc webhook-service -n metallb-system` – Thomas Jan 17 '23 at 18:34
  • Which overlay network are you using? Do you have network policies in place? – Thomas Jan 21 '23 at 19:40
  • I also had the same problem with a microk8s cluster with three nodes. Strangely enough, the problem was not there when I tried the next day. Not sure how in the world was it resolved. – Indika K Apr 07 '23 at 07:32

3 Answers3

2

Once understood the issue is fairly simple.

The metallb setup described above works as it is supposed to. However, the Kubernetes setup does not. Most likely due to bad network configuration.


Understanding the error

The key to understanding what is going on is the following error:

Error from server (InternalError): error when creating "ipaddresspool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": dial tcp 10.104.14.119:443: connect: no route to host

Part of the applied metallb manifest is going to deploy a so-called ValidatingWebhookConfiguration.

enter image description here

In the case of metallb this validating webhook will force the kube-apiserver to:

  1. send metallb-related objects like IPAddressPool to the webhook whenever someone creates or updates such an object
  2. wait for the webhook to perform some checks on the object (e.g. validate that CIDRs and IPs are valid and not something like 481.9.141.12.27)
  3. and finally receive an answer from the webhook whether or not that object satisfies metallb's requirements and is allowed to be created (persisted to etcd)

The error above pretty clearly suggests that the first out of the three outlined steps is failing.


Debugging

To fix this error one has to debug the current setup, particularly the connection from the kube-apiserver to webhook-service.metallb-system.svc:443.

There is a wide range of possible network misconfigurations that could lead to the error. However, with the information available to us it is most likely going to be an error with the configured CNI.

Knowing that here is some help and a bit of guidance regarding the further debugging process:

Since the kube-apiserver is hardened by default it won't be possible to execute a shell into it. For that reason one should deploy a debug application with the same network configuration as the kube-apiserver onto one of the control-plane nodes. This can be achieved by executing the following command:

kubectl debug -n kube-system node/<control-plane-node> -it --image=nicolaka/netshoot

Using common tools one can now reproduce the error inside the interactive shell. The following command is expected to fail (in a similar fashion to the kube-apiserver):

curl -m 10 -k https://<webhook-service-ip>:443/

Given above error message it should fail due to bad routing on the node. To check the routing table execute the following command:

routel

Does someone know why the webhook cannot be called?

The output should show multiple CIDR ranges configured one of which is supposed to include the IP queried earlier. Most likely the CIDR range in question will either be missing or a bad gateway configured which leads to the no route to host error. It is the CNIs job to update routing tables on all nodes and ensure that nodes can reach these addresses so adding or editing new Kubernetes related entries to the routing table manually is not recommended. Further debugging is dependent on the exact setup. Depending on the setup and CNI of choice kube-proxy may or may not be involved in the issue as well. However, inspecting the CNI configuration and logs is a good next step.


Some bonus information

Some CNIs require the user to pay more attention to certain features and configuration as there can be issues involved otherwise. Here are some popular CNIs that fall into this category:

F1ko
  • 3,326
  • 1
  • 9
  • 24
  • I have precisely the same issue and I am using flannel. I do not manage to understand how to resolve this. I successfully manage to create debug but my webhook is not accessible. Can you please tell me how to resolve this issue? – tauqeerahmad24 Mar 21 '23 at 15:04
0

I was having exactly the same error, but somehow I left it for a while, came back, restarted the node and tried adding the address pool again, it worked fine. I am not sure what really changed, maybe some components where still creating as at when I tried the first time, or it was just a temporary network issue, that caused the timeout.

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 14 '23 at 20:38
0

SOLVED FOR ME: I had this same issue after upgrading Rancher and etcd. After reading this I realized the upgrade may have introduced a network problem. I rebooted all worker nodes and it resolved the issue.