Hi I have a little RKE2 cluster on my Windows Data Center 2022 using Hyper-V on Rocky Linux 8.8. I am attempting to replace kube-proxy with cilium, following this guide
I have ensured that KubeProxyReplacement is set to Strict on the Cilium chart and everything works, only when there is 1 node. Shortly after I join another node, I can no longer resolve my nginx service using curl or nslookup in a pod on the cluster. Sometimes it resolves (in like 10 seconds), other times it exits with 6: Couldn't resolve host
There are no kube-proxy pods (or a DaemonSet), and I do not even see an attempt to query the coredns pod. I checked that the resolv.conf pod (where I issue the curl command) has the correct nameserver (of the coredns service). The expected behavior is this resolution should always be near instantaneous in the cluster.
Furthermore, if I use kubectl rollout to restart the coredns deployment queries return quickly for a short period of time as well, but then start acting sporadic. Doing an nslookup within a pod typically results in a timeout error to the coredns service IP, even though it will respond with the first answer right away but then continue searching?
What am I doing wrong slash where can I look for more information? Thanks!
UPDATE
I enabled the log module in coredns, not seeing any requests/queries. So there appears to be some issue where the coredns service is not even being queried when initiating an HTTP request.
Update
This was a very strange issue - RKE2 deploys coredns as a Deployment, does not mark the controlplane as unschedulable, and when I used the Pod IP of the non controlplane coredns Pod it returned immediately, but when I used the controlplane one it timed out. I cordoned the controlplane and restarted the coredns deployment and things are now good.