I built a trivial docker web app with flask and gunicorn. Everything works, but occasionally when making a request, the response hangs. I see nothing in logging until the worker times out. So it seems like the worker is busy with something, times out, then a new worker picks up the request and immediately responds.
I set it up with only 1 worker. I know I can just add another worker. But there are no other requests other than my manual poking. Nothing else happening. So I'm super curious what else this one worker or the master gunicorn worker would be doing in the container (heartbeat, error handling that expensive)?
My Dockerfile:
FROM python:3.6-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt --no-cache-dir
EXPOSE 5000
CMD ["gunicorn", "-w", "1", "-t", "30", "--worker-tmp-dir", "/dev/shm", "-b", "0.0.0.0:5000", "app:app"]
My trivial app:
import logging
import model
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.errorhandler(Exception)
def handle_error(e):
code = 500
app.log_exception(e)
return jsonify(message=str(e)), code
@app.route("/predict", methods=["POST", "GET"])
def predict():
result = model.predict(None)
return jsonify(result)
model = model.mcascorer()
if __name__ == '__main__':
app.run(debug=True, host='0.0.0.0')
else:
gunicorn_logger = logging.getLogger('gunicorn.error')
app.logger.handlers = gunicorn_logger.handlers
app.logger.setLevel(gunicorn_logger.level)
My super tiny little "model" I call predict on:
class mcascorer:
def predict_proba(self, features):
return 'hello'
def predict(self, features):
return 'hello'
Usually responds immediately but during timeouts log looks like this:
[2020-05-21 18:09:28 +0000] [9] [DEBUG] Closing connection.
[2020-05-21 18:09:58 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:9)
[2020-05-21 18:09:58 +0000] [9] [INFO] Worker exiting (pid: 9)
[2020-05-21 18:09:58 +0000] [11] [INFO] Booting worker with pid: 11
[2020-05-21 18:09:58 +0000] [11] [DEBUG] GET /predict
It appears the worker is blocked on something else - but I have no idea what that would be. A heartbeat query coming in at the same time shouldn't take that long, but it hangs for many seconds - entire duration of timeout I set for the worker actually. The only other thing happening is error logging, but not sure why that would block or take so long. Even if it were writing to disk, this seems odd.
The closest issue I could find is here: Docker: Running a Flask app via Gunicorn - Worker timeouts? Poor performance? which links to this article: https://pythonspeed.com/articles/gunicorn-in-docker/
I followed their guide and updated the Dockerfile for tmp memory location "--worker-tmp-dir", "/dev/shm"
I did not add more workers. I know I can but I would really like to know what's going on rather than blindly throwing resources at it. Any ideas are much appreciated.