13

I built a trivial docker web app with flask and gunicorn. Everything works, but occasionally when making a request, the response hangs. I see nothing in logging until the worker times out. So it seems like the worker is busy with something, times out, then a new worker picks up the request and immediately responds.

I set it up with only 1 worker. I know I can just add another worker. But there are no other requests other than my manual poking. Nothing else happening. So I'm super curious what else this one worker or the master gunicorn worker would be doing in the container (heartbeat, error handling that expensive)?

My Dockerfile:

FROM python:3.6-slim

WORKDIR /app

COPY . .

RUN pip install -r requirements.txt --no-cache-dir

EXPOSE 5000

CMD ["gunicorn", "-w", "1", "-t", "30", "--worker-tmp-dir", "/dev/shm", "-b", "0.0.0.0:5000", "app:app"]

My trivial app:

import logging
import model
from flask import Flask, request, jsonify

app = Flask(__name__)

@app.errorhandler(Exception)
def handle_error(e):
    code = 500
    app.log_exception(e)
    return jsonify(message=str(e)), code

@app.route("/predict", methods=["POST", "GET"])
def predict():
    result = model.predict(None)
    return jsonify(result)

model = model.mcascorer()

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0')
else:
    gunicorn_logger = logging.getLogger('gunicorn.error')
    app.logger.handlers = gunicorn_logger.handlers
    app.logger.setLevel(gunicorn_logger.level)

My super tiny little "model" I call predict on:

class mcascorer:

    def predict_proba(self, features):
        return 'hello'

    def predict(self, features):
        return 'hello'

Usually responds immediately but during timeouts log looks like this:

[2020-05-21 18:09:28 +0000] [9] [DEBUG] Closing connection.

[2020-05-21 18:09:58 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:9)

[2020-05-21 18:09:58 +0000] [9] [INFO] Worker exiting (pid: 9)

[2020-05-21 18:09:58 +0000] [11] [INFO] Booting worker with pid: 11

[2020-05-21 18:09:58 +0000] [11] [DEBUG] GET /predict

It appears the worker is blocked on something else - but I have no idea what that would be. A heartbeat query coming in at the same time shouldn't take that long, but it hangs for many seconds - entire duration of timeout I set for the worker actually. The only other thing happening is error logging, but not sure why that would block or take so long. Even if it were writing to disk, this seems odd.

The closest issue I could find is here: Docker: Running a Flask app via Gunicorn - Worker timeouts? Poor performance? which links to this article: https://pythonspeed.com/articles/gunicorn-in-docker/

I followed their guide and updated the Dockerfile for tmp memory location "--worker-tmp-dir", "/dev/shm"

I did not add more workers. I know I can but I would really like to know what's going on rather than blindly throwing resources at it. Any ideas are much appreciated.

jtlangton
  • 131
  • 4
  • 4
    We have same issue. Did you succeed to manage described problem? – hleb.albau Nov 03 '20 at 08:30
  • I did not, sorry. The easiest first thing to try is probably increase workers. Basically the one worker gets blocked on something and has to be booted then restarted and things go back to normal. You may check the links I included in case the other articles solve your problem. My best guess is it's my OS setup (I dev in a vm). The worker probably blocks on a system resource like trying to access some virtual memory. The worker then has to be restarted. This should show up in OS logs somewhere, or maybe VM. Since this seems to happen at regular intervals it may be a heart beat or log writing. – jojo Nov 04 '20 at 13:51

1 Answers1

0

It's discussed here: https://github.com/benoitc/gunicorn/issues/1923 They propose 2 solutions:

  1. Pass --threads 2 or higher to gunicorn
  2. Pass --preload flag For me, the first solution works. I haven't tried the second one.
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158