0

Hi I have a little RKE2 cluster on my Windows Data Center 2022 using Hyper-V on Rocky Linux 8.8. I am attempting to replace kube-proxy with cilium, following this guide

I have ensured that KubeProxyReplacement is set to Strict on the Cilium chart and everything works, only when there is 1 node. Shortly after I join another node, I can no longer resolve my nginx service using curl or nslookup in a pod on the cluster. Sometimes it resolves (in like 10 seconds), other times it exits with 6: Couldn't resolve host

There are no kube-proxy pods (or a DaemonSet), and I do not even see an attempt to query the coredns pod. I checked that the resolv.conf pod (where I issue the curl command) has the correct nameserver (of the coredns service). The expected behavior is this resolution should always be near instantaneous in the cluster.

Furthermore, if I use kubectl rollout to restart the coredns deployment queries return quickly for a short period of time as well, but then start acting sporadic. Doing an nslookup within a pod typically results in a timeout error to the coredns service IP, even though it will respond with the first answer right away but then continue searching?

What am I doing wrong slash where can I look for more information? Thanks!

UPDATE

I enabled the log module in coredns, not seeing any requests/queries. So there appears to be some issue where the coredns service is not even being queried when initiating an HTTP request.

Update

This was a very strange issue - RKE2 deploys coredns as a Deployment, does not mark the controlplane as unschedulable, and when I used the Pod IP of the non controlplane coredns Pod it returned immediately, but when I used the controlplane one it timed out. I cordoned the controlplane and restarted the coredns deployment and things are now good.

user1314147
  • 174
  • 1
  • 5
  • 25
  • Did you try to trace the DNS request with Hubble or tcpdump to see where it is dropped? Did you check if things work when using kube-proxy? – pchaigno Aug 23 '23 at 22:23

0 Answers0