We've encountered an issue with our deployment on Kubernetes being run on GKE, recently.
It would seem that, randomly, our NGINX front end containers, that serve our Front End application, seemingly die. This has caused quite the commotion, as nginx-ingress would just tell us that there was an HTTP2 protocol error. After about a week of turmoil, we finally noticed in the logs for the FE containers that this output was happening anytime we had the HTTP2 Protocol Error (in chrome):
Once we switched the nginx-ingress down to HTTP1, we have an error of ERR_CONTENT_LENGTH_MISMATCH 200, but this is still a misleading error.
This is a gist of all of our configs for those that are interested: gist
As for the nginx version, I tried the following:
stable-alpine
stable
mainline-alpine
1.17.10-alpine
All result in the same set of logs.
Things I have tried:
- change the nginx version for the FE
- tell the nginx-ingress to use HTTP 1
- tell the nginx-ingress to not use GZIP
- used everything from this site tencent high availability in nginx blog post
- turned On and Off proxy buffering, both for the nginx-ingress as a whole, and each individual child-ingress
- set max-temp-file-size to 0 in the nginx-ingress
- set max-temp-file-size to 10M in the nginx-ingress
- removed Accept Encoding, Content-Length, and Content-Type from the request to the upstream
- turned on Gzip for the FE container
- set worker processes to auto, set worker processes to 1 in the FE container
- set keepalive-timeout to 65, set it to 15 in the FE container
- updated the lifecyle preStop on the FE deployment
- set terminationGracePeriodSeconds to 60 (then removed it) from the FE deployment
Before anyone asks: all of the configurations done to the nginx-ingress have thus far been attempts to solve the HTTP2 protocol error. Obviously none of these work because if the upstream server is down, this doesn't matter.
What I can deduce is that while NGINX is shutting down (why, I still don't know), the container itself is not Restarting, and thus that pod is effectively a zombie case. How do I either A. force a restart or B. force the pod to die and respawn?
Of course, if someone has an answer as to why the nginx container is told to shutdown in the first place, that would also be helpful.
Another potentially related issue, is that sometimes the replicas of this deployment do not start, container is ready, but no logs or connections.
Killing the pods manually, seem to fix this issue, but this is not a real solution.
The cluster is running n1-standard-2 nodes, and we've got autoscale enabled, so CPU/Memory/Storage are not (should not be, never say never) an issue.
Thanks in advance! Leave a comment if I can improve this question in anyway.
Edit #1: Included that we are on GKE.
Edit #2: I've added readiness and liveliness probes. I've updated the nginx FE server with a health check route. This seems to be working as a failsafe to ensure that if the internal nginx process stops or doesn't even start, the container will restart. However if anyone has better alternatives or root causes I'd love to know! Perhaps I should set specific cpu and memory requests for each pod?