11

I have a very large (~400k lines) Python function that I am attempting to define through an exec() call. If I run the following Python script:

exec("""def blah()
# 400k lines of IF/THEN/ELSE
""", globals())
blah()

By calling Python from the command line, it works fine.

However, if I do the same within a Django instance, it crashes the server without any error message or stack trace, which I can only assume is due to a segmentation fault.

Both Django runserver and the above script are run from the same Conda enviroment, and both have unlimited stack available (confirmed by printing out resource.getrlimit in Django).

Here's my full ulimit -a output:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 515017
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

The command sequence to launch the Django server is as follows:

source activate <conda env name>
python manage.py runserver

This is the shell input/output leading to the crash:

(faf) [pymaster@t9dpyths3 faf]$ python manage.py runserver 9000
Watching for file changes with StatReloader
Performing system checks...

System check identified no issues (0 silenced).
August 04, 2020 - 08:25:19
Django version 3.0.3, using settings 'faf.settings'
Starting development server at http://127.0.0.1:9000/
Quit the server with CONTROL-C.
[04/Aug/2020 08:25:25] "GET /projects/ HTTP/1.1" 200 13847
[04/Aug/2020 08:26:49] "PUT /projects/projectname/ HTTP/1.1" 200 76  # This event triggers the exec
(faf) [pymaster@t9dpyths3 faf]$
Darrrrrren
  • 5,968
  • 5
  • 34
  • 51
  • 8
    I'm hoping you've just been given this code and are in the process of making it more manageable. – hrokr Aug 04 '20 at 03:45
  • Please reproduce the crash and then run `sudo dmesg -T` directly afterwards. Post the output (the most recent lines are sufficient). – toydarian Aug 04 '20 at 08:01
  • @hrokr, I'm productionizing a large tree ensemble model on an AS400 system using a proprietary banking language that only supports very basic programming functions - massive number of IF-THEN-ELSE is basically all I have :-) – Darrrrrren Aug 04 '20 at 12:09
  • 7
    @Darrrrrren One said to me that "if you have enough `if` statements you can simulate the universe". I guess it's becoming true :D – Asocia Aug 04 '20 at 12:11
  • @toydarian - nothing in dmesg from this event timestamp - I am running the Django development server - increasing verbosity on the `runsever` command is not yielding any more information. I'm wondering what else I can do – Darrrrrren Aug 04 '20 at 12:33
  • Is `systemd` involved in any place? What is the command you use to run the django development server? Is there no output at all? Nothing on `stdout` or `stderr`, no log file, no `core` dump? What is the return code? How long does the command run until it fails? Sorry, those are a lot of questions... – toydarian Aug 04 '20 at 12:39
  • @toydarian, no problem, I know it's tough to diagnose these problems via text. I've added some more information in the OP. The command does begin to run like it's processing the function - it runs for about 30 seconds before it crashes, which to me feels like a resource issue - but it's weird that it runs fine via the direct Python CLI. If there is a core dump, where would it be? – Darrrrrren Aug 04 '20 at 12:46
  • The coredump would be in the same directory. It would be called `core`. Can you run `echo $?` after the crash? That should just show a number. Post that number. – toydarian Aug 04 '20 at 12:47
  • The number is 245 – Darrrrrren Aug 04 '20 at 13:06
  • 1
    That is an odd exit code, first guess is, that somewhere in your code, you will find something like `sys.exit(245)`. Can you grep for that number in your code? That would also explain the missing error message. If you find it more than once, add `print` statements to find out which line is the culprit. – toydarian Aug 04 '20 at 13:11
  • Hi @toydarian, I don't have that anywhere in my code. I searched the Django repo on github and didn't find that number in any code either. Some google searches for that exit code have led me to some interesting leads I'm going to pursue – Darrrrrren Aug 04 '20 at 13:23
  • All right, I have done the same thing with the django repo. In `django/utils/autoreload.py` line `240`, you can find a function called `restart_with_reloader` that can lead to arbitrary exit codes. Otherwise django will not return anything outside the range [0-3]. – toydarian Aug 04 '20 at 13:26
  • Do you have any `sys.exit` in your code? Maybe the return code is not hardcoded, but set via a variable or something like that. When you find out what caused it, let me know. I'm curious... :D – toydarian Aug 05 '20 at 05:59
  • 1
    You should enable coredumps to debug this. Run `ulimit -c unlimited` before you start the server (from the same shell, it is per-shell property). The you will see `core file size` - `unlimited` and not zero as now. After the crash, run `coredumpctl` to see the core files generated. If you don't have `coredumpctl` configured on your system, core file will be generated in your working directory. Open it with gdb, see where it comes from. If there is no coredump after that - it means some code called `exit` exlicity and this is not a crash. Check the exit code of the server (`echo $?`). – Yuri Nudelman Aug 05 '20 at 08:17
  • Thanks @YuriNudelman, that makes a lot of sense, and explains why I'm not seeing core dumps. Unfortunately, I am working on a corporate server with restricted permissions, and for some reason while I'm able to modify stack size, I'm not permitted to change the core size. – Darrrrrren Aug 05 '20 at 11:09
  • @YuriNudelman I'm going to pursue getting the core file size increased - the hard limit is set to 0 in our /etc/security/limits.conf file. Thanks! – Darrrrrren Aug 05 '20 at 11:22
  • Rather sounds like it should be data instead of code... – superb rain Aug 06 '20 at 14:13
  • I was experimenting with python and segfaults, and I always get the "normal" exit code `139` but never `245`. – toydarian Aug 07 '20 at 08:26
  • Okay so I ran the server with `python manage.py runserver --nothreading --noreload` and now I'm getting a "Segmentation Fault" message when the server crashes. so at least I know that's the issue. Looks like with threading and automatic server reloading on code changes it was spawning the process in a new thread that wasn't returning the Segmentation Fault message to the main thread, so it was getting hidden. – Darrrrrren Aug 07 '20 at 12:34
  • What if saving the code (text) in a separate *.py* file and then importing it? – CristiFati Aug 07 '20 at 15:32
  • Hi @CristiFati, that's actually what I've done as a workaround in the meantime, and it does work as I hoped it would. I would still like to know why Django seems to segfault though. – Darrrrrren Aug 07 '20 at 15:46
  • What happens with the memory when you try to *exec* that code? Aren't you running out? Also does that code any processing, database connection, file access, library loading? BTW: Could you share that code? – CristiFati Aug 07 '20 at 16:07
  • Coupe you please try with the fault handler and then dump the traceback? Which version of python are you on? – alexisdevarennes Aug 11 '20 at 06:40
  • I see you've updated your bounty description, I'd love to look into why this is happening but would need the traceback, any way to get have on it? – alexisdevarennes Aug 11 '20 at 13:16

4 Answers4

4

The problem might be due to int(s), float(s) and others may cause segmentation fault

As mentioned here:

Please try setting the environmental flag PYTHONMALLOC=debug

This might allow your code to run without running into segmentation errors, if you do still get an error you should be able to catch it using.

PYTHONMALLOC=debug python3 -X tracemalloc=10

You might also want to check out: faulthandler

This module contains functions to dump Python tracebacks explicitly, on a fault, after a timeout, or on a user signal. Call faulthandler.enable() to install fault handlers for the SIGSEGV, SIGFPE, SIGABRT, SIGBUS, and SIGILL signals. You can also enable them at startup by setting the PYTHONFAULTHANDLER environment variable or by using the -X faulthandler command line option.

Adding this for more clarity since it's related; the following is taken from the answer provided by Darrrrrren and is a tweak to make faulthandler run on threaded django applications:

So I was able to get a stack trace by initializing Python with faulthandler, but additionally I had to run manage.py runserver --nothreading --noreload - for some reason if you do not disable threading with Django, even faulthandler will not print a stack trace.

alexisdevarennes
  • 5,437
  • 4
  • 24
  • 38
  • [`tracemalloc`](https://docs.python.org/3/library/tracemalloc.html) is a solid suggestion too! Also follows along the lines of this issue, which may be related: https://stackoverflow.com/questions/60137930/why-docker-django-admin-crash-with-code-245 – ti7 Aug 10 '20 at 20:20
  • Thanks - faulthandler eventually was the way I was able to get a stack trace. It only produced one if I used it in conjunction with the django nothreading option. – Darrrrrren Aug 11 '20 at 14:19
3

This sounds like a job to divide and conquer!

Split your exec block up into parts to find where it fails, attempting to catch BaseException rather than Exception and dumping progress

If you believe you're hitting a segfault, you can handle it using signal.signal(signalnum, handler) example

As they're guaranteed to be a contained block of logic, you could begin new blocks to execute by splitting at def and if statements. If most if statements are at the highest scope, you should be able to split directly on them, otherwise some additional scope detection will be needed.

import signal
import sys

CONTENT_AND_POS = {
    "text_lines": [],    # first attempt is exec("") without if
    "block_line_no": 1,  # first block should be at line 1+
}

def report(text_lines, line_no, msg=""):
    """ display progress to the console """
    print("running code block at {}:{}\n{}".format(
        line_no, msg, text_lines))  # NOTE reordered from args

def signal_handler_segfault(signum, frame):
    """ try to show where the segfault occurred """
    report(
        "\n".join(CONTENT_AND_POS["text_lines"]),
        CONTENT_AND_POS["block_line_no"],
        "SIGNAL {}".format(signum)
    )
    sys.exit("caught segfault")

# initial setup
signal.signal(signal.SIGSEGV, signal_handler_segfault)
path_code_to_exec = sys.argv[1]  # consider argparse
print("reading from {}".format(path_code_to_exec))

# main entrypoint
with open(path_code_to_exec) as fh:
    for line_no, line in enumerate(fh, 1):  # files are iterable by-line
        if line.startswith(("def", "if")):  # new block to try
            text_exec_block = "\n".join(CONTENT_AND_POS["text_lines"])
            try:
                exec(text_exec_block, globals())
            except BaseException as ex:
                report(
                    text_exec_block,
                    CONTENT_AND_POS["block_line_no"],
                    str(repr(ex)))
                # catching BaseException will squash exit, ctrl+C, et al.
                sys.exit("caught BaseException")
            # reset for the next block
            CONTENT_AND_POS["block_line_no"] = line_no  # new block begins
            CONTENT_AND_POS["text_lines"].clear()
        # continue with new or existing block
        CONTENT_AND_POS["text_lines"].append(line)

    # execute the last block (which is otherwise missed)
    exec_text_lines(
        CONTENT_AND_POS["text_lines"],
        CONTENT_AND_POS["block_line_no"]
    )

print("successfully executed {} lines".format(line_no))

If this still ends silently, output the line number of each block before executing it. You may need to write to a file or sys.stdout/stderr to ensure output isn't lost

primital
  • 28
  • 1
  • 4
ti7
  • 16,375
  • 6
  • 40
  • 68
  • 1
    Thanks for your help. I hadn't heard of `signal.signal(...)` before - I did set up the handler, but it's not getting called - the script just crashes with the typical `Segmentation Fault` message if I launch the server without threading. Hoping to get core dumps enabled shortly... – Darrrrrren Aug 07 '20 at 12:50
  • Glad to help! I'm surprised adding a `signal` handler doesn't have any effect, which leads me to suspect the fault comes from the Python interpreter itself (some form of bug), wrong signal/causes later, or some binary WSGI voodoo. [faulthandler](https://docs.python.org/3/library/faulthandler.html) may be able to collect some sort of traceback as described [here](https://blog.richard.do/2018/03/18/how-to-debug-segmentation-fault-in-python/) if you're using Python 3 .. but if you have anything like the latest interpreter, expect bugs to be [quite elusive](https://bugs.python.org/issue14010) – ti7 Aug 10 '20 at 07:03
2

If you're using Python 2 (perhaps accidentally), you're simply passing too much to exec

You can reproduce this as follows (also see a relevant codegolf!)

% python2
>>> exec(
... """if True:
...     pass
... """ * (200 * 1000)  # 400k lines
... )
segmentation fault python2

You should be able to fix this by breaking it up (described in my other answer), or by writing the code to a file and importing it instead (as suggested/already implemented in comments)

This limit on exec should be fixed in Python 3 (RecursionError), but may affect a few unlucky versions (see ticket).

ti7
  • 16,375
  • 6
  • 40
  • 68
1

So I was able to get a stack trace by initializing Python with faulthandler, but additionally I had to run manage.py runserver --nothreading --noreload - for some reason if you do not disable threading with Django, even faulthandler will not print a stack trace.

Fatal Python error: Segmentation fault

Current thread 0x00007fe61836b740 (most recent call first):
  File "/apps/AADD/projects/FAF/Web App/faf/modelling/views.py", line 42 in index
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/handlers/base.py", line 113 in _get_response
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/handlers/exception.py", line 34 in inner
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/utils/deprecation.py", line 94 in __call__
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/handlers/exception.py", line 34 in inner
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/utils/deprecation.py", line 94 in __call__
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/handlers/exception.py", line 34 in inner
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/utils/deprecation.py", line 94 in __call__
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/handlers/exception.py", line 34 in inner
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/utils/deprecation.py", line 94 in __call__
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/handlers/exception.py", line 34 in inner
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/utils/deprecation.py", line 94 in __call__
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/handlers/exception.py", line 34 in inner
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/utils/deprecation.py", line 94 in __call__
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/handlers/exception.py", line 34 in inner
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/utils/deprecation.py", line 94 in __call__
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/handlers/exception.py", line 34 in inner
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/handlers/base.py", line 75 in get_response
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/handlers/wsgi.py", line 133 in __call__
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/contrib/staticfiles/handlers.py", line 68 in __call__
  File "/apps/AADD/envs/faf/lib/python3.6/wsgiref/handlers.py", line 137 in run
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/servers/basehttp.py", line 197 in handle_one_request
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/servers/basehttp.py", line 172 in handle
  File "/apps/AADD/envs/faf/lib/python3.6/socketserver.py", line 724 in __init__
  File "/apps/AADD/envs/faf/lib/python3.6/socketserver.py", line 364 in finish_request
  File "/apps/AADD/envs/faf/lib/python3.6/socketserver.py", line 351 in process_request
  File "/apps/AADD/envs/faf/lib/python3.6/socketserver.py", line 320 in _handle_request_noblock
  File "/apps/AADD/envs/faf/lib/python3.6/socketserver.py", line 241 in serve_forever
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/servers/basehttp.py", line 216 in run
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/management/commands/runserver.py", line 139 in inner_run
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/management/commands/runserver.py", line 104 in run
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/management/commands/runserver.py", line 95 in handle
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/management/base.py", line 369 in execute
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/management/commands/runserver.py", line 60 in execute
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/management/base.py", line 328 in run_from_argv
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/management/__init__.py", line 395 in execute
  File "/apps/AADD/envs/faf/lib/python3.6/site-packages/django/core/management/__init__.py", line 401 in execute_from_command_line
  File "manage.py", line 17 in main
  File "manage.py", line 21 in <module>
Segmentation fault

If I provide unlimited stack space, the exec() actually works in Django, but only with --nothreading . So I have a hunch that Django is somehow restricting stack size to spawned off threads.

Darrrrrren
  • 5,968
  • 5
  • 34
  • 51