"[CRITICAL] WORKER TIMEOUT" in logs when running "Hello Cloud Run with Python" from GCP Setup Docs

Question

Following the tutorial here I have the following 2 files:

app.py

from flask import Flask, request

app = Flask(__name__)


@app.route('/', methods=['GET'])
def hello():
    """Return a friendly HTTP greeting."""
    who = request.args.get('who', 'World')
    return f'Hello {who}!\n'


if __name__ == '__main__':
    # Used when running locally only. When deploying to Cloud Run,
    # a webserver process such as Gunicorn will serve the app.
    app.run(host='localhost', port=8080, debug=True)

Dockerfile

# Use an official lightweight Python image.
# https://hub.docker.com/_/python
FROM python:3.7-slim

# Install production dependencies.
RUN pip install Flask gunicorn

# Copy local code to the container image.
WORKDIR /app
COPY . .

# Service must listen to $PORT environment variable.
# This default value facilitates local development.
ENV PORT 8080

# Run the web service on container startup. Here we use the gunicorn
# webserver, with one worker process and 8 threads.
# For environments with multiple CPU cores, increase the number of workers
# to be equal to the cores available.
CMD exec gunicorn --bind 0.0.0.0:$PORT --workers 1 --threads 8 app:app

I then build and run them using Cloud Build and Cloud Run:

PROJECT_ID=$(gcloud config get-value project)
DOCKER_IMG="gcr.io/$PROJECT_ID/helloworld-python"
gcloud builds submit --tag $DOCKER_IMG
gcloud run deploy --image $DOCKER_IMG --platform managed

The code appears to run fine, and I am able to access the app on the given URL. However the logs seem to indicate a critical error, and the workers keep restarting. Here is the log file from Cloud Run after starting up the app and making a few requests in my web browser:

2020-03-05T03:37:39.392Z Cloud Run CreateService helloworld-python ...
2020-03-05T03:38:03.285477Z[2020-03-05 03:38:03 +0000] [1] [INFO] Starting gunicorn 20.0.4
2020-03-05T03:38:03.287294Z[2020-03-05 03:38:03 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)
2020-03-05T03:38:03.287362Z[2020-03-05 03:38:03 +0000] [1] [INFO] Using worker: threads
2020-03-05T03:38:03.318392Z[2020-03-05 03:38:03 +0000] [4] [INFO] Booting worker with pid: 4
2020-03-05T03:38:15.057898Z[2020-03-05 03:38:15 +0000] [1] [INFO] Starting gunicorn 20.0.4
2020-03-05T03:38:15.059571Z[2020-03-05 03:38:15 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)
2020-03-05T03:38:15.059609Z[2020-03-05 03:38:15 +0000] [1] [INFO] Using worker: threads
2020-03-05T03:38:15.099443Z[2020-03-05 03:38:15 +0000] [4] [INFO] Booting worker with pid: 4
2020-03-05T03:38:16.320286ZGET200 297 B 2.9 s Safari 13  https://helloworld-python-xhd7w5igiq-ue.a.run.app/
2020-03-05T03:38:16.489044ZGET404 508 B 6 ms Safari 13  https://helloworld-python-xhd7w5igiq-ue.a.run.app/favicon.ico
2020-03-05T03:38:21.575528ZGET200 288 B 6 ms Safari 13  https://helloworld-python-xhd7w5igiq-ue.a.run.app/
2020-03-05T03:38:27.000761ZGET200 285 B 5 ms Safari 13  https://helloworld-python-xhd7w5igiq-ue.a.run.app/?who=me
2020-03-05T03:38:27.347258ZGET404 508 B 13 ms Safari 13  https://helloworld-python-xhd7w5igiq-ue.a.run.app/favicon.ico
2020-03-05T03:38:34.802266Z[2020-03-05 03:38:34 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:4)
2020-03-05T03:38:35.302340Z[2020-03-05 03:38:35 +0000] [4] [INFO] Worker exiting (pid: 4)
2020-03-05T03:38:48.803505Z[2020-03-05 03:38:48 +0000] [5] [INFO] Booting worker with pid: 5
2020-03-05T03:39:10.202062Z[2020-03-05 03:39:09 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:5)
2020-03-05T03:39:10.702339Z[2020-03-05 03:39:10 +0000] [5] [INFO] Worker exiting (pid: 5)
2020-03-05T03:39:18.801194Z[2020-03-05 03:39:18 +0000] [6] [INFO] Booting worker with pid: 6

Note the worker timeouts and reboots at the end of the logs. The fact that its a CRITICAL error makes me think it shouldn't be happing. Is this expected behavior? Is this a side effect of the Cloud Run machinery starting and stopping my service as requests come and go?

I am not sure exactly why this is happening. 1) Cloud Run does not support background threads. Any threads will be CPU idled to 0 in-between HTTP requests which will cause TCP connections, etc to fail. 2) You do not need gunicorn. You can simply use `CMD [ "python", "app.py" ]` in your Dockerfile. — John Hanley, Mar 05 '20 at 04:49
For app.py read the port number from the environment like this: `app.run(debug=True, host='0.0.0.0', port=int(os.environ.get('PORT', 8080))` — John Hanley, Mar 05 '20 at 04:52
@JohnHanley I was under the impression that you should only use the built in flask server for development, and never in production. — jminardi, Mar 05 '20 at 04:58
This is true for Cloud Run as well. Werkzeug (the built-in Flask HTTP server) is not suitable for production use. — Dustin Ingram, Mar 05 '20 at 21:01

Dustin Ingram · Accepted Answer · 2020-04-16T01:52:13.383

32

Cloud Run has scaled down one of your instances, and the gunicorn arbiter is considering it stalled.

You should add --timeout 0 to your gunicorn invocation to disable the worker timeout entirely, it's unnecessary for Cloud Run.

edited Apr 16 '20 at 01:52

answered Mar 05 '20 at 21:03

Dustin Ingram

20,502
7
59
82

3

The `--preload` option seems to have fixed the issue. How is it possible that the simplest flask app possible is taking too long to start? Are you aware of any tradeoffs when using the `--preload` option, or any other considerations I should be aware of? – jminardi Mar 05 '20 at 21:17
This might be an issue with how Gunicorn interprets a stalled worker and Cloud Run's runtime, it doesn't seem like the application is actually taking that long to start. The tradeoffs are listed here: https://docs.gunicorn.org/en/stable/settings.html#preload-app – Dustin Ingram Mar 05 '20 at 22:44
It seems like this might be the root cause: https://docs.gunicorn.org/en/stable/faq.html#blocking-os-fchmod – Dustin Ingram Mar 05 '20 at 22:59
Thank you. I accepted your answer because the `--preload` option seemed to work for me. – jminardi Mar 06 '20 at 08:22

score 4 · Answer 2 · answered Dec 31 '21 at 05:05

4

i was facing the error [11229] [CRITICAL] WORKER TIMEOUT (pid:11232) on heroku i changed my Procfile to this

web: gunicorn --workers=3 app:app --timeout 200 --log-file -

and it fixed my problem by incresing the --timeout

answered Dec 31 '21 at 05:05

Muhammad Zakaria

1,269
6
14

1

Increasing ```timeout``` was good solution for me. I am dealing with plotly plots that take sometimes long time to render. – eemilk Feb 03 '22 at 12:20

Waelmas · Answer 3 · 2020-03-06T08:29:16.897

Here's a working example of a Flask app on Cloud run. My guess is that your last line or the Decker file and the last part of your python file are the ones causing this behavior.

main.py

# main.py
#gcloud beta run services replace service.yaml


from flask import Flask

app = Flask(__name__)

@app.route("/")
def hello_world():

        msg = "Hello World"
    return msg

Dockerfile (the apt-get part is not needed)

# Use the official Python image.
# https://hub.docker.com/_/python
FROM python:3.7

# Install manually all the missing libraries
RUN apt-get update
RUN apt-get install -y gconf-service libasound2 libatk1.0-0 libcairo2 libcups2 libfontconfig1 libgdk-pixbuf2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libxss1 fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils

# Install Python dependencies.
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt

ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . .

CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 main:app

then build using:

gcloud builds submit --tag gcr.io/[PROJECT]/[MY_SERVICE]

and deploy:

gcloud beta run deploy [MY_SERVICE] --image gcr.io/[PROJECT]/[MY_SERVICE] --region europe-west1 --platform managed

UPDATE I've checked again the logs you've provided. Getting this kind of warning/error is normal at the beginning after a new deployment as your old instances are not handling any requests but instead they are idle at that time until they are completely shut down.

Gunicorn also has a default timeout of 30s which matches with the time between the time of "Booting worker" and the time you see the error.

I tried launching a totally new service so it wouldn't have any old instances to shut down but I still got the timeout. — jminardi, Mar 08 '20 at 16:49

score 0 · Answer 4 · answered Feb 19 '23 at 20:52

for those who are entering here and have this problem but with django (probably it will work the same) with gunicorn, supervisor and nginx, check your configuration in the gunicorn_start file or where you have the gunicorn parameters, in my case I have it like this, in the last line add the timeout

NAME="myapp"                                  # Name of the application
DJANGODIR=/webapps/myapp             # Django project directory
SOCKFILE=/webapps/myapp/run/gunicorn.sock  # we will communicte using this unix socket
USER=root                                        # the user to run as
GROUP=root                                     # the group to run as
NUM_WORKERS=3                                     # how many worker processes should Gunicorn spawn
DJANGO_SETTINGS_MODULE=myapp.settings             # which settings file should Django use
DJANGO_WSGI_MODULE=myapp.wsgi                     # WSGI module name

echo "Starting $NAME as `whoami`"

# Activate the virtual environment
cd $DJANGODIR
source ../bin/activate
export DJANGO_SETTINGS_MODULE=$DJANGO_SETTINGS_MODULE
export PYTHONPATH=$DJANGODIR:$PYTHONPATH

# Create the run directory if it doesn't exist
RUNDIR=$(dirname $SOCKFILE)
test -d $RUNDIR || mkdir -p $RUNDIR

# Start your Django Unicorn
# Programs meant to be run under supervisor should not daemonize themselves (do not use --daemon)
exec ../bin/gunicorn ${DJANGO_WSGI_MODULE}:application \
  --name $NAME \
  --workers $NUM_WORKERS \
  --user=$USER --group=$GROUP \
  --bind=unix:$SOCKFILE \
  --log-level=debug \
  --log-file=- \
  --timeout 120 #This

"[CRITICAL] WORKER TIMEOUT" in logs when running "Hello Cloud Run with Python" from GCP Setup Docs

app.py

Dockerfile

4 Answers4

Linked