I am using a deployment of Spring Boot (typical micro-service web server deployment, with Gateway, separate authentication server, etc, fronted with a reverse proxy/load balancing nginx deployment). We orchestrate Docker containers with Kubernetes. We are preparing for production deployment and have recently started load testing, revealing some issues in the handling of these loads.
My issue is that when subjecting the server to high loads (here, performance testing with Gatling), the liveness probes return 503 errors, because of heavy load; this triggers a restart by Kubernetes.
Naturally, the liveness probe is important, but when the system starts dropping requests, the last thing we should do is to kill pods, which causes cascading failures by shifting load to the remaining pods.
This specific problem with the Spring Actuator health check is described in this SO question, and offers some hints, but the answers are not thorough. Specifically, the idea of using a liveness command (e.g. to check if the java process is running) seems to me inadequate, since it would miss actual down-time if the java process is running but there is some exception, or some missing resource (database, Kafka...)
- Is there a good guide for configuring production Spring on Kubernetes/Cloud deployments?
- How do I deal with the specific issue of the liveness probe failing when subjected to high loads, does anyone have experience with this?