72

I am running a Flask application and hosting it on Kubernetes from a Docker container. Gunicorn is managing workers that reply to API requests.

The following warning message is a regular occurrence, and it seems like requests are being canceled for some reason. On Kubernetes, the pod is showing no odd behavior or restarts and stays within 80% of its memory and CPU limits.

[2021-03-31 16:30:31 +0200] [1] [WARNING] Worker with pid 26 was terminated due to signal 9

How can we find out why these workers are killed?

Jodiug
  • 5,425
  • 6
  • 32
  • 48
  • Did you manage to find out why? Having the same issue, and tried specifying the `--shm-size` - but no avail. – lionbigcat Jun 03 '21 at 09:56
  • Our problems seem to have gone away since we started using `--worker-class gevent`. I suspect Simon is right and this was either an out of memory error, or a background process running for too long and the main process (1) decided to kill it. – Jodiug Jun 04 '21 at 07:49
  • Meta: I'm not sure why this question is being downvoted. Please drop a comment if you feel it needs further clarification. – Jodiug Jun 04 '21 at 07:53
  • 2
    I have the same problem, and gevent did not solve it. does anyone knows why this started all of a sudden? was there a change in gunicorn or in kube? – Blop Jun 13 '21 at 06:57
  • also related to a non answered question: https://stackoverflow.com/questions/57745100/gunicorn-issues-on-gcloud-memory-faults-and-restarts-thread – Blop Jun 13 '21 at 11:53
  • @Blop - my issue was OOM-related. I had to use a larger instance with more RAM, and gave the docker container access to that RAM. – lionbigcat Jun 15 '21 at 16:57
  • @lionbigcat ye, eventually that's exactly what I did as well. just adding another 1GB fixed the problem. no need to change to gevent. – Blop Jun 16 '21 at 08:21
  • I faced the same issue and solved it by switching from python 3.8 to python 3.7 – Vincent Agnes Aug 29 '21 at 21:42

10 Answers10

67

I encountered the same warning message.

[WARNING] Worker with pid 71 was terminated due to signal 9

I came across this faq, which says that "A common cause of SIGKILL is when OOM killer terminates a process due to low memory condition."

I used dmesg realized that indeed it was killed because it was running out of memory.

Out of memory: Killed process 776660 (gunicorn)
Simon
  • 686
  • 5
  • 2
  • 2
    Our problems seem to have gone away since we started using `--worker-class gevent`. I can't verify this answer, but it seems that `dmesg` is a good way to get more information and diagnose the problem. Thanks for your answer! – Jodiug Jun 04 '21 at 07:49
  • I noticed this happen when I didn't provide enough memory to Docker Desktop, which was running Gunicorn workers within a container. Increasing the memory to Docker Desktop solved the problem. – phoenix Jan 03 '23 at 12:57
29

In our case application was taking around 5-7 minutes to load ML models and dictionaries into memory. So adding timeout period of 600 seconds solved the problem for us.

gunicorn main:app \
   --workers 1 \
   --worker-class uvicorn.workers.UvicornWorker \
   --bind 0.0.0.0:8443 \
   --timeout 600
Yoooda
  • 31
  • 2
  • 7
ACL
  • 409
  • 3
  • 4
3

I encountered the same warning message when I limit the docker's memory, use like -m 3000m.

see docker-memory

and

gunicorn-Why are Workers Silently Killed?

The simple way to avoid this is set a high memory for docker or not set.

hstk
  • 163
  • 2
  • 10
2

I was using AWS Beanstalk to deploy my flask application and I had a similar error.

  • In the log I saw:
  • web: MemoryError
  • [CRITICAL] WORKER TIMEOUT
  • [WARNING] Worker with pid XXXXX was terminated due to signal 9

I was using the t2.micro instance and when I changed it to t2.medium my app worked fine. In addition to this I changed to the timeout in my nginx config file.

Vkey
  • 41
  • 5
  • Mind sharing the timeout variable name? – Snehangsu Jun 16 '22 at 15:28
  • Below is the contents on my timeout.conf file under the nginx>conf.d folder keepalive_timeout 600s; proxy_connect_timeout 600s; proxy_send_timeout 600s; proxy_read_timeout 600s; fastcgi_send_timeout 600s; fastcgi_read_timeout 600s; client_max_body_size 20M; – Vkey Jun 20 '22 at 10:47
1

It may be that your liveness check in kubernetes is killing your workers.

If your liveness check is configured as an http request to an endpoint in your service, your main request may block the health check request, and the worker gets killed by your platform because the platform thinks that the worker is unresponsive.

That was my case. I have a gunicorn app with a single uvicorn worker, which only handles one request at a time. It worked fine locally but would have the worker sporadically killed when deployed to kubernetes. It would only happen during a call that takes about 25 seconds, and not every time.

It turned out that my liveness check was configuredto hit the /health route every 10 seconds, time out in 1 second, and retry 3 times. So this call would time out some times but not always.

If this is your case, a possible solution is to reconfigure your liveness check (or whatever health check mechanism your platform uses) so it can wait until your typical request finishes. Or allow for more threads - something that makes sure that the health check is not blocked for long enough to trigger worker kill.

You can see that adding more workers may help with (or hide) the problem.

Also, see this reply to a similar question: https://stackoverflow.com/a/73993486/2363627

Gena Kukartsev
  • 1,515
  • 2
  • 17
  • 19
1

I encountered the same problem too. and it was because docker memory usage was limited to 2GB. If you are using docker desktop you just need to go to resources and increase the memory docker dedicated portion (if not you need to find the docker command line to do that).

If that doesn't solve the problem, then it might be the timeout that kill the worker, you will need to add timeout arg to the gunicorn command:

CMD ["gunicorn","--workers", "3", "--timeout", "1000", "--bind", "0.0.0.0:8000", "wsgi:app"]
Yoooda
  • 31
  • 2
  • 7
1

In my case. I need to connect to a remote databse on private network that requires me to connect to a VPN first, and I forgot that.

So, check your database connection or anything that cause your app waiting for a long time.

afifabroory
  • 11
  • 1
  • 4
  • Please phrase this as an explained conditional answer, in order to avoid the impression of asking a clarification question instead of answering (for which a comment should be used instead of an answer, compare https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead ). For example like "If your problem is ... then the solution is to .... because .... ." – Yunnosch Mar 26 '23 at 04:05
  • This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/34105903) –  Mar 30 '23 at 02:55
1

In my case, I first noticed that decreasing the number of workers from 4 to 2 worked. However, I believe that the problem is related to the connection to the db, I tried with -w4 but I restarted my server that contains the db and it worked perfectly.

Mithsew
  • 1,129
  • 8
  • 20
0

In my case the problem was in long application startup caused by ml model warm-up (over 3s)

EgurnovD
  • 165
  • 1
  • 4
0

Check memory usage

In my case, I can not use dmesg command. so I check memory usage as docker command:

sudo docker stats <container-id>

CONTAINER ID   NAME               CPU %     MEM USAGE / LIMIT   MEM %     NET I/O        BLOCK I/O         PIDS
289e1ad7bd1d   funny_sutherland   0.01%     169MiB / 1.908GiB   8.65%     151kB / 96kB   8.23MB / 21.5kB   5

In my case, terminating workers are not caused by memory.

Yoooda
  • 31
  • 2
  • 7
  • Hey. Did you find anything else than memory that could kill your workers ? – Sami Boudoukha May 23 '23 at 17:30
  • @SamiBoudoukha actually my case was not because memory issue. I use Django and it failed connect with database internally with no failure log. nothing else – Yoooda May 24 '23 at 00:20