14

Specifically, I'm running a Flask app with default workers in gunicorn. I'm trying to figure out how to debug / trace what is happening when a worker is killed due to timeout while serving a request. Is there a way to get a stack trace or profile the request to debug this?

krishonadish
  • 959
  • 2
  • 9
  • 18

3 Answers3

15

Honestly a very good question, I was not sure if it is even possible to achieve the same or not. But as I started digging I found lot of interesting threads

Showing the stack trace from a running Python application

Get stacktrace from stuck python process

https://github.com/khamidou/lptrace

https://gist.github.com/reywood/e221c4061bbf2eccea885c9b2e4ef496

So first I created a simple flask app with below code

app.py

from flask import Flask
app = Flask(__name__)
import time

def a():
  b()

def b():
  c ()

def c():
   i = 0
   while i < 900:
       time.sleep(1)
       i += 1

@app.route('/', defaults={'path': ''})
@app.route('/<path:path>')
def catch_all(path):
    a()
    return 'You want path: %s' % path

if __name__ == '__main__':
    app.run()

wsgi.py

from app import app

if __name__ == "__main__":
    app.run()

Now running the app like below and doing curl localhost:8000/abc

$ gunicorn wsgi:app
[2019-08-01 08:19:06 +0000] [26825] [INFO] Starting gunicorn 19.9.0
[2019-08-01 08:19:06 +0000] [26825] [INFO] Listening at: http://127.0.0.1:8000 (26825)
[2019-08-01 08:19:06 +0000] [26825] [INFO] Using worker: sync
[2019-08-01 08:19:06 +0000] [26828] [INFO] Booting worker with pid: 26828
[2019-08-01 08:19:40 +0000] [26825] [CRITICAL] WORKER TIMEOUT (pid:26828)
[2019-08-01 08:19:40 +0000] [26828] [INFO] Worker exiting (pid: 26828)
[2019-08-01 08:19:40 +0000] [26832] [INFO] Booting worker with pid: 26832

Now what we need is a hook which can be called before the worked is killed. gunicorn supports server events in the configuration file

So now we create a config file

gunicorn_config.py

timeout = 3

def worker_abort(worker):
    pid = worker.pid
    print("worker is being killed - {}".format(pid))

And our output is now

$ gunicorn -c gunicorn_config.py wsgi:app
[2019-08-01 08:22:17 +0000] [26837] [INFO] Starting gunicorn 19.9.0
[2019-08-01 08:22:17 +0000] [26837] [INFO] Listening at: http://127.0.0.1:8000 (26837)
[2019-08-01 08:22:17 +0000] [26837] [INFO] Using worker: sync
[2019-08-01 08:22:17 +0000] [26840] [INFO] Booting worker with pid: 26840
[2019-08-01 08:22:22 +0000] [26837] [CRITICAL] WORKER TIMEOUT (pid:26840)
worker is being killed - 26840
[2019-08-01 08:22:22 +0000] [26840] [INFO] Worker exiting (pid: 26840)
[2019-08-01 08:22:22 +0000] [26844] [INFO] Booting worker with pid: 26844

This is good, now we need mix our previous knowledge on pyrasite and this to get the stack. So we updated the config file like below

gunicorn_config.py

timeout = 3

__code_dump_stack__ = """
import sys, traceback

for thread, frame in sys._current_frames().items():
    print('Thread 0x%x' % thread)
    traceback.print_stack(frame)
    print()
"""

def dump_stack_for_process(pid):
    import pyrasite

    ipc = pyrasite.PyrasiteIPC(pid)
    ipc.connect()
    print(ipc.cmd(__code_dump_stack__))
    ipc.close()

def worker_abort(worker):
    pid = worker.pid
    print("worker is being killed - {}".format(pid))
    dump_stack_for_process(pid)

And now our output is

$ [2019-08-01 08:25:29 +0000] [26848] [INFO] Starting gunicorn 19.9.0
[2019-08-01 08:25:29 +0000] [26848] [INFO] Listening at: http://127.0.0.1:8000 (26848)
[2019-08-01 08:25:29 +0000] [26848] [INFO] Using worker: sync
[2019-08-01 08:25:29 +0000] [26851] [INFO] Booting worker with pid: 26851
[2019-08-01 08:25:38 +0000] [26848] [CRITICAL] WORKER TIMEOUT (pid:26851)
worker is being killed - 26851
Thread 0x7ff0a7a4b700
  File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "<string>", line 72, in run
  File "<string>", line 92, in on_command
  File "<string>", line 6, in <module>

Thread 0x7ff0ac512700
  File "/home/vagrant/.local/bin/gunicorn", line 11, in <module>
    sys.exit(run())
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/app/wsgiapp.py", line 61, in run
    WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/app/base.py", line 223, in run
    super(Application, self).run()
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/app/base.py", line 72, in run
    Arbiter(self).run()
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 203, in run
    self.manage_workers()
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 545, in manage_workers
    self.spawn_workers()
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 616, in spawn_workers
    self.spawn_worker()
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 583, in spawn_worker
    worker.init_process()
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/base.py", line 134, in init_process
    self.run()
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 124, in run
    self.run_for_one(timeout)
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 68, in run_for_one
    self.accept(listener)
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 30, in accept
    self.handle(listener, client, addr)
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 135, in handle
    self.handle_request(listener, req, client, addr)
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 176, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/home/vagrant/.local/lib/python3.5/site-packages/flask/app.py", line 2463, in __call__
    return self.wsgi_app(environ, start_response)
  File "/home/vagrant/.local/lib/python3.5/site-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/vagrant/.local/lib/python3.5/site-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/vagrant/.local/lib/python3.5/site-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/vagrant/remotedebug/app.py", line 20, in catch_all
    a()
  File "/home/vagrant/remotedebug/app.py", line 6, in a
    b()
  File "/home/vagrant/remotedebug/app.py", line 9, in b
    c ()
  File "/home/vagrant/remotedebug/app.py", line 14, in c
    time.sleep(1)
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/base.py", line 195, in handle_abort
    self.cfg.worker_abort(self)
  File "gunicorn_config.py", line 23, in worker_abort
    dump_stack_for_process(pid)
  File "gunicorn_config.py", line 17, in dump_stack_for_process
    print(ipc.cmd(__code_dump_stack__))
  File "/home/vagrant/.local/lib/python3.5/site-packages/pyrasite/ipc.py", line 161, in cmd
    return self.recv()
  File "/home/vagrant/.local/lib/python3.5/site-packages/pyrasite/ipc.py", line 174, in recv
    header_data = self.recv_bytes(4)
  File "/home/vagrant/.local/lib/python3.5/site-packages/pyrasite/ipc.py", line 187, in recv_bytes
    chunk = self.sock.recv(n - len(data))


[2019-08-01 08:25:38 +0000] [26851] [INFO] Worker exiting (pid: 26851)
[2019-08-01 08:25:38 +0000] [26862] [INFO] Booting worker with pid: 26862

The stack trace is large but it gives us what we need

Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265
  • is pyrasite needed here? from some experimentation, worker_abort seems to run on the worker process – slushi Dec 07 '19 at 02:44
  • were you able to get the stack trace from that without `pyrasite`? – Tarun Lalwani Dec 07 '19 at 10:38
  • 1
    For what it's worth I just tried with `exec(__code_dump_stack__)` instead of the few pyrasite lines and it prints a correct stacktrace. The docs also mention that `worker_abort` runs in the worker process so it makes sense. – ldirer Sep 23 '21 at 10:28
12

Just wanted to add to the accepted answer that there's, also, another way to get similar results by using the built-in Python 3 module faulthandler introduced in v3.3 https://docs.python.org/dev/library/faulthandler.html

You can add the following two lines to wsgi.py mentioned in the accepted answer:

import faulthandler
faulthandler.enable()

When timeout occurs the following output will be printed to stderr:

[2020-12-24 13:38:32 +0000] [31304] [INFO] Starting gunicorn 20.0.4
[2020-12-24 13:38:32 +0000] [31304] [INFO] Listening at: http://0.0.0.0:8888 (31304)
[2020-12-24 13:38:32 +0000] [31304] [INFO] Using worker: sync
[2020-12-24 13:38:32 +0000] [31307] [INFO] Booting worker with pid: 31307
[2020-12-24 13:38:55 +0000] [31304] [CRITICAL] WORKER TIMEOUT (pid:31307)
Fatal Python error: Aborted

Current thread 0x00007f411d781700 (most recent call first):
  File "/tmp/app.py", line 14 in c
  File "/tmp/app.py", line 9 in b
  File "/tmp/app.py", line 6 in a
  File "/tmp/app.py", line 20 in catch_all
  File "/tmp/venv/lib/python3.5/site-packages/flask/app.py", line 1936 in dispatch_request
  File "/tmp/venv/lib/python3.5/site-packages/flask/app.py", line 1950 in full_dispatch_request
  File "/tmp/venv/lib/python3.5/site-packages/flask/app.py", line 2447 in wsgi_app
  File "/tmp/venv/lib/python3.5/site-packages/flask/app.py", line 2464 in __call__
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 175 in handle_request
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 134 in handle
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 29 in accept
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 67 in run_for_one
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 123 in run
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/base.py", line 140 in init_process
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/arbiter.py", line 583 in spawn_worker
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/arbiter.py", line 616 in spawn_workers
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/arbiter.py", line 545 in manage_workers
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/arbiter.py", line 202 in run
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/app/base.py", line 72 in run
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/app/base.py", line 228 in run
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/app/wsgiapp.py", line 58 in run
  File "/tmp/venv/bin/gunicorn", line 8 in <module>
[2020-12-24 13:38:55 +0000] [31307] [INFO] Worker exiting (pid: 31307)
[2020-12-24 13:38:55 +0000] [31503] [INFO] Booting worker with pid: 31503
dchrome
  • 255
  • 2
  • 8
0

Also add to the accepted answer, I find import traceback; traceback.print_stack() works for me since worker_abort runs int he context of worker process. So, just need to change the gunicorn_config.py to something like this

import traceback

timeout = 3

def worker_abort(worker):
    pid = worker.pid
    print("worker is being killed - {}".format(pid))
    traceback.print_stack()
Gang Peng
  • 21
  • 3