Frequent worker timeout

Question

I have setup gunicorn with 3 workers, 30 worker connections and using eventlet worker class. It is set up behind Nginx. After every few requests, I see this in the logs.

[ERROR] gunicorn.error: WORKER TIMEOUT (pid:23475)
None
[INFO] gunicorn.error: Booting worker with pid: 23514

Why is this happening? How can I figure out what's going wrong?

You were able to solve the problem ? Please share your thoughts as I also stuck with it. `Gunicorn==19.3.1` and `gevent==1.0.1` — Black_Rider, May 20 '15 at 05:41
Found the solution for it. Increased timeout to very large value and then I was able to see stack trace — Black_Rider, May 20 '15 at 08:38

score 309 · Answer 1 · answered Jun 19 '14 at 11:52

309

We had the same problem using Django+nginx+gunicorn. From Gunicorn documentation we have configured the graceful-timeout that made almost no difference.

After some testings, we found the solution, the parameter to configure is: timeout (And not graceful timeout). It works like a clock..

So, Do:

1) open the gunicorn configuration file

2) set the TIMEOUT to what ever you need - the value is in seconds

NUM_WORKERS=3
TIMEOUT=120

exec gunicorn ${DJANGO_WSGI_MODULE}:application \
--name $NAME \
--workers $NUM_WORKERS \
--timeout $TIMEOUT \
--log-level=debug \
--bind=127.0.0.1:9000 \
--pid=$PIDFILE

answered Jun 19 '14 at 11:52

Amit Talmor

7,174
4
25
29

16

Thanks this is the right answer. And then, in order to save resources with many concurrent connections: `pip install gevent` , then `worker_class gevent` in your config file or `-k gevent` on the command line. – little_birdie Jan 05 '16 at 04:11
9

Am running with supervisor so added it to **conf.d/app.conf**: `command=/opt/env_vars/run_with_env.sh /path/to/environment_variables /path/to/gunicorn --timeout 200 --workers 3 --bind unix:/path/to/socket server.wsgi:application` – lukik Dec 01 '18 at 06:55
Add on - timeout unit is seconds, Command line: -t INT or --timeout INT (Default - 30 seconds). Workers silent for more than this many seconds are killed and restarted. details here - https://docs.gunicorn.org/en/stable/settings.html#settings – SACHIN DUHAN Jul 10 '23 at 11:54

score 76 · Answer 2 · edited Jan 05 '18 at 06:45

76

On Google Cloud Just add --timeout 90 to entrypoint in app.yaml

entrypoint: gunicorn -b :$PORT main:app --timeout 90

edited Jan 05 '18 at 06:45

clemens

16,716
11
50
65

answered Jan 05 '18 at 06:26

Apurv Agarwal

3,008
18
19

1

Why 90 sec timeout? – Devy Jan 28 '23 at 01:01
just pick a large number, 900. Not too large, if theres a real problem you don't want to wait indefinitely – ahron Jul 14 '23 at 15:35

score 34 · Answer 3 · edited Jul 28 '20 at 08:12

34

Run Gunicorn with --log-level debug.

It should give you an app stack trace.

edited Jul 28 '20 at 08:12

Ahmed Mohamedeen

328
3
11

answered Aug 18 '12 at 16:21

gwik

679
5
9

13

I'd love to get a stracktrace, but none of them work here, using gunicorn 19.4.5. Debug stuff is displayed, so i guess the flag was recognized, but not stacktrace on timeout. – orzel Jul 12 '17 at 14:56
4

Same here, no stack trace with the flag enabled – Thomas Gak-Deluen May 03 '21 at 10:02
You could override the [worker_abort](https://docs.gunicorn.org/en/stable/settings.html#worker-abort) function in a config file to log a traceback. – Eric Smith Feb 25 '22 at 00:39

score 27 · Answer 4 · answered Jan 21 '21 at 11:22

27

The Microsoft Azure official documentation for running Flask Apps on Azure App Services (Linux App) states the use of timeout as 600

gunicorn --bind=0.0.0.0 --timeout 600 application:app

https://learn.microsoft.com/en-us/azure/app-service/configure-language-python#flask-app

answered Jan 21 '21 at 11:22

Chayan Bansal

1,857
1
13
23

2

Seems a little excessive, but I do appreciate that is official documentation, so I will go with it. – Moir Apr 21 '22 at 13:57

score 26 · Answer 5 · edited Jan 05 '21 at 10:21

26

Is this endpoint taking too many time?

Maybe you are using flask without asynchronous support, so every request will block the call. To create async support without make difficult, add the gevent worker.

With gevent, a new call will spawn a new thread, and you app will be able to receive more requests

pip install gevent
gunicon .... --worker-class gevent

edited Jan 05 '21 at 10:21

illagrenan

6,033
2
54
66

answered Apr 23 '20 at 13:20

Ramon Medeiros

2,272
2
24
41

score 16 · Answer 6 · answered Aug 09 '18 at 09:31

WORKER TIMEOUT means your application cannot response to the request in a defined amount of time. You can set this using gunicorn timeout settings. Some application need more time to response than another.

Another thing that may affect this is choosing the worker type

The default synchronous workers assume that your application is resource-bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. An example of something that takes an undefined amount of time is a request to the internet. At some point the external network will fail in such a way that clients will pile up on your servers. So, in this sense, any web application which makes outgoing requests to APIs will benefit from an asynchronous worker.

When I got the same problem as yours (I was trying to deploy my application using Docker Swarm), I've tried to increase the timeout and using another type of worker class. But all failed.

And then I suddenly realised I was limitting my resource too low for the service inside my compose file. This is the thing slowed down the application in my case

deploy:
  replicas: 5
  resources:
    limits:
      cpus: "0.1"
      memory: 50M
  restart_policy:
    condition: on-failure

So I suggest you to check what thing slowing down your application in the first place

Ranc · Answer 7 · 2017-10-02T18:30:07.313

15

Could it be this? http://docs.gunicorn.org/en/latest/settings.html#timeout

Other possibilities could be your response is taking too long or is stuck waiting.

edited Oct 02 '17 at 18:30

answered Aug 06 '13 at 03:34

Ranc

163
1
4

score 13 · Answer 8 · edited Jun 08 '20 at 01:23

13

This worked for me:

gunicorn app:app -b :8080 --timeout 120 --workers=3 --threads=3 --worker-connections=1000

If you have eventlet add:

--worker-class=eventlet

If you have gevent add:

--worker-class=gevent

edited Jun 08 '20 at 01:23

Matt Ke

3,599
12
30
49

answered Jun 08 '20 at 01:01

Skerrepy

354
4
6

9

Fun facts, `--worker-class` and `-k` are analogues, as well as `--timeout` and `-t` – ThisGuyCantEven Aug 13 '20 at 13:46

score 10 · Answer 9 · answered May 09 '19 at 19:24

I've got the same problem in Docker.

In Docker I keep trained LightGBM model + Flask serving requests. As HTTP server I used gunicorn 19.9.0. When I run my code locally on my Mac laptop everything worked just perfect, but when I ran the app in Docker my POST JSON requests were freezing for some time, then gunicorn worker had been failing with [CRITICAL] WORKER TIMEOUT exception.

I tried tons of different approaches, but the only one solved my issue was adding worker_class=gthread.

Here is my complete config:

import multiprocessing

workers = multiprocessing.cpu_count() * 2 + 1
accesslog = "-" # STDOUT
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(q)s" "%(D)s"'
bind = "0.0.0.0:5000"
keepalive = 120
timeout = 120
worker_class = "gthread"
threads = 3

score 7 · Answer 10 · answered May 07 '14 at 10:28

You need to used an other worker type class an async one like gevent or tornado see this for more explanation : First explantion :

You may also want to install Eventlet or Gevent if you expect that your application code may need to pause for extended periods of time during request processing

Second one :

The default synchronous workers assume that your application is resource bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. For instance, a request to the internet meets this criteria. At some point the external network will fail in such a way that clients will pile up on your servers.

How would I actually make use of such a different worker class? — Frederick Nord, Aug 12 '18 at 00:52
@FrederickNord it can be set via the `-k`/`--worker_class` option, see https://docs.gunicorn.org/en/stable/settings.html#worker-class — skwidbreth, Jan 28 '22 at 02:11

score 7 · Answer 11 · answered Sep 18 '15 at 11:06

7

I had very similar problem, I also tried using "runserver" to see if I could find anything but all I had was a message Killed

So I thought it could be resource problem, and I went ahead to give more RAM to the instance, and it worked.

answered Sep 18 '15 at 11:06

James Lin

25,028
36
133
233

2

I was seeing this problem with even with gevent and the timeout set correctly, out of memory was the problem – bcattle Sep 28 '16 at 07:08
Yes. The timeout was because it took too long to talk to the worker with the server out of memory. I watched `docker stats`, fixed the code that was using up the memory, and was fine. – Noumenon Nov 16 '21 at 05:55

score 2 · Answer 12 · answered Jun 11 '19 at 01:05

2

If you are using GCP then you have to set workers per instance type.

Link to GCP best practices https://cloud.google.com/appengine/docs/standard/python3/runtime

answered Jun 11 '19 at 01:05

Haider Lasne

21
2

score 1 · Answer 13 · answered Nov 28 '19 at 08:53

timeout is a key parameter to this problem.

however it's not suit for me.

i found there is not gunicorn timeout error when i set workers=1.

when i look though my code, i found some socket connect (socket.send & socket.recv) in server init.

socket.recv will block my code and that's why it always timeout when workers>1

hope to give some ideas to the people who have some problem with me

score 0 · Answer 14 · answered Nov 20 '19 at 04:17

0

For me, the solution was to add --timeout 90 to my entrypoint, but it wasn't working because I had TWO entrypoints defined, one in app.yaml, and another in my Dockerfile. I deleted the unused entrypoint and added --timeout 90 in the other.

answered Nov 20 '19 at 04:17

Preethi Vaidyanathan

1,203
1
12
32

score 0 · Answer 15 · answered Sep 14 '20 at 07:04

0

For me, it was because I forgot to setup firewall rule on database server for my Django.

answered Sep 14 '20 at 07:04

frank

178
2
7

Susan Enneking · Answer 16 · 2020-10-16T23:59:43.900

0

Frank's answer pointed me in the right direction. I have a Digital Ocean droplet accessing a managed Digital Ocean Postgresql database. All I needed to do was add my droplet to the database's "Trusted Sources".

(click on database in DO console, then click on settings. Edit Trusted Sources and select droplet name (click in editable area and it will be suggested to you)).

edited Oct 16 '20 at 23:59

answered Oct 15 '20 at 12:19

Susan Enneking

1
1

score 0 · Answer 17 · answered Oct 10 '22 at 16:01

Check that your workers are not killed by a health check. A long request may block the health check request, and the worker gets killed by your platform because the platform thinks that the worker is unresponsive.

E.g. if you have a 25-second-long request, and a liveness check is configured to hit a different endpoint in the same service every 10 seconds, time out in 1 second, and retry 3 times, this gives 10+1*3 ~ 13 seconds, and you can see that it would trigger some times but not always.

The solution, if this is your case, is to reconfigure your liveness check (or whatever health check mechanism your platform uses) so it can wait until your typical request finishes. Or allow for more threads - something that makes sure that the health check is not blocked for long enough to trigger worker kill.

You can see that adding more workers may help with (or hide) the problem.

score 0 · Answer 18 · answered Oct 15 '22 at 23:25

The easiest way that worked for me is to create a new config.py file in the same folder where your app.py exists and to put inside it the timeout and all your desired special configuration:

timeout = 999

Then just run the server while pointing to this configuration file

gunicorn -c config.py --bind 0.0.0.0:5000 wsgi:app

note that for this statement to work you need wsgi.py also in the same directory having the following

from myproject import app

if __name__ == "__main__":
    app.run()

Cheers!

score 0 · Answer 19 · answered Jan 14 '23 at 17:32

Apart from the gunicorn timeout settings which are already suggested, since you are using nginx in front, you can check if these 2 parameters works, proxy_connect_timeout and proxy_read_timeout which are by default 60 seconds. Can set them like this in your nginx configuration file as,

proxy_connect_timeout 120s;
proxy_read_timeout 120s;

Daniel Olson · Answer 20 · 2023-03-24T02:51:46.767

In my case I came across this issue when sending larger(10MB) files to my server. My development server(app.run()) received them no problem but gunicorn could not handle them.

for people who come to the same problem I did. My solution was to send it in chunks like this: ref / html example, separate large files ref


    def upload_to_server():
        upload_file_path = location
    
        def read_in_chunks(file_object, chunk_size=524288):
            """Lazy function (generator) to read a file piece by piece.
            Default chunk size: 1k."""
            while True:
                data = file_object.read(chunk_size)
                if not data:
                    break
                yield data
    
        with open(upload_file_path, 'rb') as f:
            for piece in read_in_chunks(f):
                r = requests.post(
                    url + '/api/set-doc/stream' + '/' + server_file_name,
                    files={name: piece},
                    headers={'key': key, 'allow_all': 'true'})

my flask server:


    @app.route('/api/set-doc/stream/<name>', methods=['GET', 'POST'])
    def api_set_file_streamed(name):
        folder = escape(name)  # secure_filename(escape(name))
        if 'key' in request.headers:
            if request.headers['key'] != key:                
                return ''
        else:
            return ''
        for fn in request.files:
            file = request.files[fn]
            if fn == '':
                print('no file name')
                flash('No selected file')
                return 'fail'
            if file and allowed_file(file.filename):
                file_dir_path = os.path.join(app.config['UPLOAD_FOLDER'], folder)
                if not os.path.exists(file_dir_path):
                    os.makedirs(file_dir_path)
                file_path = os.path.join(file_dir_path, secure_filename(file.filename)) 
                with open(file_path, 'ab') as f:
                    f.write(file.read())
                return 'sucess'
        return ''

score -6 · Answer 21 · answered Sep 15 '22 at 09:13

-6

in case you have changed the name of the django project you should also go to

cd /etc/systemd/system/

then

sudo nano gunicorn.service

then verify that at the end of the bind line the application name has been changed to the new application name

answered Sep 15 '22 at 09:13

Drayen Dörff

93
6

1

This answer is extremly bad, it has no value. You just saying "open notebook and verify that your config is fine". Also you should rename "gunicorn.service" to "yourprojectname.service" – oruchkin Sep 16 '22 at 17:36

Frequent worker timeout

21 Answers21

Linked