7

I'm trying to debug rare hangs of a Django application. So far I failed to isolate the problem, it happens about once a day in production and Gunicorn restarts the process with a message:

[CRITICAL] WORKER TIMEOUT

Is there a way to configure Django or Gunicorn to dump a stack trace of a process that is restarted?

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
Jan Wrobel
  • 6,969
  • 3
  • 37
  • 53

3 Answers3

7

Try setting your Gunicorn log to be more verbose, maybe set it to INFO or DEBUG that might shed more light in the log.

You could also take a look at Dog Slow, which will log slow requests. https://pypi.python.org/pypi/dogslow.

And for general logging win, try using Sentry: https://www.getsentry.com/welcome/.

Random question, any crons on the server running at that time, backups, that sorta thing?

krak3n
  • 948
  • 6
  • 13
  • 1
    Dog Slow looks great, I can set its timeout to be a little lower than Gunicorn timeout and get a trace of a locked process. Thanks! This runs on Heroku, so there shouldn't be any heavy background processes. – Jan Wrobel Mar 17 '13 at 07:06
  • 1
    Nick ODell's answer below helped me, but I'm wondering if the same should have been achievable by setting the Gunicorn log levels as suggested by this, accepted answer. Or was it the dogslow suggestion that solved the OP's problem? – Tuomo Mattila Nov 22 '19 at 14:15
4

This will print a stack trace for the worker at the time that it gets killed. You'll need to create a gunicorn config file, and paste the following function into it.

import traceback
import io

def worker_abort(worker):
    debug_info = io.StringIO()
    debug_info.write("Traceback at time of timeout:\n")
    traceback.print_stack(file=debug_info)
    worker.log.critical(debug_info.getvalue())

(The above is written for Python 3. If you use Python 2, you'll need to change io to StringIO. See How to use StringIO in Python3?)

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
1

timeout is not meant as a request timeout. It's meant as a liveness check for workers. For sync workers, this functions as a request timeout because the worker cannot do anything other than process the request. The asynchronous workers heartbeat even while they are handling long running requests, so unless the worker blocks/freezes it won't be killed.

Gunicorn has a function called worker_abort (see gunicorn docs below).

def worker_abort(worker):
    worker.log.info("worker received abort signal")
    import threading, sys, traceback
    id2name = dict([(th.ident, th.name) for th in threading.enumerate()])
    code = []
    for threadId, stack in sys._current_frames().items():
        code.append("\n# Thread: %s(%d)" % (id2name.get(threadId,""), threadId))
        stack = traceback.extract_stack(stack)
        for filename, lineno, name, line in stack:
            code.append('File: "%s", line %d, in %s' % (filename, lineno, name))
            if line:
                code.append("  %s" % (line.strip()))
    worker.log.debug("\n".join(code))

Called when a worker received the SIGABRT signal. This call generally happens on timeout. The callable needs to accept one instance variable for the initialised Worker.

Sources:

http://docs.gunicorn.org/en/stable/settings.html, https://github.com/benoitc/gunicorn/issues/1493, https://github.com/benoitc/gunicorn/blob/master/examples/example_config.py

Sami Start
  • 1,755
  • 15
  • 9