Update
We commented out the django migration and collectstatic from our Dockerfile and we managed to make a new deployment (the liveness/readinnes probe passed). We thought it would be something related to one of them, but then we returned with both python manage.py migrate
and python manage.py collecstatic
and everything kept working. So, the deploys are working but we don't know why they stopped working.
But we still cannot connect to the healthy/running pods using kubectl
. We are still receiving timeout errors. Even from Gitlab interface.
We have an application running on a Kubernetes cluster managed by the Gitlab AutoDevops. A few days ago, for some reason unknown, we are not able to connect to our Pods using kubectl
anymore. We receive Error from server: error dialing backend: dial timeout, backstop
. To connect to the pod we use kubectl -n <namespace> exec -it <pod> -- bash
Additionally, at the same time, our deploys started to fail due to liveness and readiness probe failure. Checking on GKE, we see the messages:
Readiness probe failed: Get "http://10.59.1.234:5000/": dial tcp 10.59.1.232:5000: connect: connection refused
Liveness probe failed: Get "http://10.59.1.234:5000/": dial tcp 10.59.1.232:5000: connect: connection refused
I've tried to increase the value of initialDelaySeconds
- Helm variable to control the Probe - but without success. There is a timeout error after 5min (Error: release review-fix-run-pi-xrsd0t failed, and has been uninstalled due to atomic being set: timed out waiting for the condition
)
The application is still up and running, but we can't make new deploys or access the pods.
Below there is the output from the command kubectl -n <namespace> describe pod <pod>
which was executed during the pipeline. After a few minutes, the pipeline failed.
IP: 10.59.1.234
IPs:
IP: 10.59.1.234
Port: 5000/TCP
Host Port: 0/TCP
State: Running
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 75s default-scheduler Successfully assigned daeb5798-review-fix-run-pi-xrsd0t/review-fix-run-pi-xrsd0t-6c975b9f4d-wgc4r to gke-os-us-central1-default-pool-da55e92e-8ssdxx
Normal Pulling 55s kubelet Pulling image "registry.gitlab.com/fix-run-pipeline:6a43a82369e87eee4ad86023694167aef6886451"
Normal Pulled 53s kubelet Successfully pulled image "registry.gitlab.com/fix-run-pipeline:6a43a82369e87eee4ad86023694167aef6886451" in 2.819782241s
Normal Created 51s kubelet Created container auto-deploy-app
Normal Started 49s kubelet Started container auto-deploy-app
Warning Unhealthy 5s (x4 over 35s) kubelet Readiness probe failed: Get "http://10.59.1.234:5000/readiness/": dial tcp 10.59.1.234:5000: connect: connection refused
Warning Unhealthy 5s (x3 over 25s) kubelet Liveness probe failed: Get "http://10.59.1.234:5000/healthz/": dial tcp 10.59.1.234:5000: connect: connection refused
Normal Killing 5s kubelet Container auto-deploy-app failed liveness probe, will be restarted
Something tells me that this issue is related to our cluster internal network. But I don't know where to go from here.
We also noticed some warnings about pods with unready status:
Any tips on how to solve or investigate this issue?
Thanks in advance.