1

Update

We commented out the django migration and collectstatic from our Dockerfile and we managed to make a new deployment (the liveness/readinnes probe passed). We thought it would be something related to one of them, but then we returned with both python manage.py migrate and python manage.py collecstatic and everything kept working. So, the deploys are working but we don't know why they stopped working.

But we still cannot connect to the healthy/running pods using kubectl. We are still receiving timeout errors. Even from Gitlab interface.


We have an application running on a Kubernetes cluster managed by the Gitlab AutoDevops. A few days ago, for some reason unknown, we are not able to connect to our Pods using kubectl anymore. We receive Error from server: error dialing backend: dial timeout, backstop. To connect to the pod we use kubectl -n <namespace> exec -it <pod> -- bash

Additionally, at the same time, our deploys started to fail due to liveness and readiness probe failure. Checking on GKE, we see the messages:

  • Readiness probe failed: Get "http://10.59.1.234:5000/": dial tcp 10.59.1.232:5000: connect: connection refused
  • Liveness probe failed: Get "http://10.59.1.234:5000/": dial tcp 10.59.1.232:5000: connect: connection refused

I've tried to increase the value of initialDelaySeconds - Helm variable to control the Probe - but without success. There is a timeout error after 5min (Error: release review-fix-run-pi-xrsd0t failed, and has been uninstalled due to atomic being set: timed out waiting for the condition)

The application is still up and running, but we can't make new deploys or access the pods.

Below there is the output from the command kubectl -n <namespace> describe pod <pod> which was executed during the pipeline. After a few minutes, the pipeline failed.

IP:           10.59.1.234
IPs:
  IP:           10.59.1.234

Port:           5000/TCP
Host Port:      0/TCP
State:          Running

Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  75s               default-scheduler  Successfully assigned daeb5798-review-fix-run-pi-xrsd0t/review-fix-run-pi-xrsd0t-6c975b9f4d-wgc4r to gke-os-us-central1-default-pool-da55e92e-8ssdxx
  Normal   Pulling    55s               kubelet            Pulling image "registry.gitlab.com/fix-run-pipeline:6a43a82369e87eee4ad86023694167aef6886451"
  Normal   Pulled     53s               kubelet            Successfully pulled image "registry.gitlab.com/fix-run-pipeline:6a43a82369e87eee4ad86023694167aef6886451" in 2.819782241s
  Normal   Created    51s               kubelet            Created container auto-deploy-app
  Normal   Started    49s               kubelet            Started container auto-deploy-app
  Warning  Unhealthy  5s (x4 over 35s)  kubelet            Readiness probe failed: Get "http://10.59.1.234:5000/readiness/": dial tcp 10.59.1.234:5000: connect: connection refused
  Warning  Unhealthy  5s (x3 over 25s)  kubelet            Liveness probe failed: Get "http://10.59.1.234:5000/healthz/": dial tcp 10.59.1.234:5000: connect: connection refused
  Normal   Killing    5s                kubelet            Container auto-deploy-app failed liveness probe, will be restarted

Something tells me that this issue is related to our cluster internal network. But I don't know where to go from here.

We also noticed some warnings about pods with unready status: enter image description here

Any tips on how to solve or investigate this issue?

Thanks in advance.

Rhenan Bartels
  • 391
  • 2
  • 6
  • 14
  • `...dial tcp 10.59.1.234:5000: connect: connection refused` - have you checked the pod log to ensure the server which listen to port 5000 is up & running? – gohm'c Dec 16 '21 at 04:48
  • 1
    I am taking a wild guess here, I guess your application has some error cause it didn't listen to port 5000. Remove `Liveness` and `Readiness` from YAML file, so the cluster don't kill the pod, then you can connect the pod by command `kubectl -n exec -it -- bash` – yip102011 Dec 16 '21 at 06:19
  • Did you already determine which pod is unhealthy/having issues? Can you check or share the logs. You can use this filter on Logs Explorer to check. (resource.type="k8s_container" resource.labels.pod_name=POD_NAME) – JaysonM Dec 16 '21 at 07:44
  • @gohm'c I've just created a new git branch from the main branch, which currently running. The pipeline also failed due to liveness/readiness probe failure – Rhenan Bartels Dec 16 '21 at 13:35
  • @yip102011 I've tried to remove the `Liveness` and `Readiness` from the pipeline, but as we use the Gitlab AutoDevops the Helm still uses its default value for the health check. What I did was update `.gitlab/auto-deploy-values.yaml` file and set both `livenessProbe` and `readinessProbe` to `{}`. Any tips on how to disable the liveness and readiness step? Helm values reference: https://gitlab.com/gitlab-org/cluster-integration/auto-deploy-image/-/tree/master/assets/auto-deploy-app – Rhenan Bartels Dec 16 '21 at 14:00
  • Can you provide your values file? – yip102011 Dec 17 '21 at 03:35

0 Answers0