4

My application depends on ghostscript to turn some pdf files into a series of images for each page of the documents. This is a simplified version:

import locale

from ghostscript import Ghostscript as gs
from ghostscript import cleanup
from cv2 import imread, IMREAD_GRAYSCALE as GRAY
from multiprocessing import cpu_count

args = [
  "",
  "-q", "-r300", "-dNOPAUSE",
  "-sDEVICE=pgmraw",
  "-sOutputFile=%d.pgm",
  "-dNumRenderingThreads=" + str(cpu_count()),
  "-f", "_.pdf" #filename will always be "_.pdf"
]
encoding = locale.getpreferredencoding()
args = [a.encode(encoding) for a in args]

def pdftoimarray():
    cleanup()
    gs(*args)
    imarray = []
    for filename in os.listdir():
        imarray.append(imread(filename, GRAY))
    return imarray

(I removed the cleanup of the filesystem at the end on purpose: It's not really important for this question)

Problem is, I can't really trust the source of these documents, and some of them may be faulty. Running some tests, I discovered that some of these bad documents cause ghostscript to actually segfault, which in turn makes my entire application crash.

Normally, a segfault is a very serious event that we can't really recover from, so I'm skeptical if it is actually possible to trap it. But in my case it shouldn't be really that serious: Assuming my program is still in a valid state, I could just flag that document as bad and move on.

Question: Can I somehow trap this segmentation fault in my dependency, and recover from it?

This has been somewhat asked before in Segmentation Fault Catch, but the only answer is wrong (It suggests trapping it with signal.signal, but the documentation clearly says that catching synchronous signals such as SIGSEGV makes little sense using it. The same documentation points to faulthandler, but it can't really trap the signal: It just provides better error messages in case it happens).

This leaves the question of how is this question unique, and not a duplicate: I'm somewhat less restricted: I'm not intending to treat the problem at all: I just want to ignore it and move on. Any points on actually avoiding the segfault in ghostscript in the first place will also be very well received.

This question is a bit old, but I thought I should share this: I was watching a video about a cool new memory allocator, and on one of the questions from the audience, the author explains that he "Installs a segfault handler", which is very much what I am interested in. I still don't know how he does it exactly, so this doesn't answer my question completely, but it gives me a good place to start researching. I'll post an answer here if I manage to figure this out myself.

Here is the video (the link is at the time he answers the question I'm talking about) https://youtu.be/c1UBJbfR-H0?t=2058

Not a real meerkat
  • 5,604
  • 1
  • 24
  • 55
  • 1
    I don't think you can ignore it. Your code attempted to access memory that doesn't belong to it. It should immediately be killed by the OS, in this case. – ForceBru Oct 29 '18 at 18:32
  • Yes, you can trap that, but you can't recover from it. Actually, that's one of your least problems: Assume someone creates a file that you then pass to Ghostscript. Ghostscript has some kind of buffer overflow or similar bug which then causes the attacker to take over your system. Passing untrusted data to broken software is your problem. – Ulrich Eckhardt Oct 29 '18 at 18:38
  • @ForceBru I was couting on the OS simply killing the actual offender (The ghostscript dependency) and leaving the python interpreter alone (which is the reason a library like `faulthandler` would be able to actually do some stuff after receiving the signal before exiting). Am I wrong in my assumption? – Not a real meerkat Oct 29 '18 at 18:39
  • @CássioRenan, how (and why) would you detect and isolate the library that was responsible? Everything is executed by the Python interpreter, and it’s the interpreter that interacts with the OS, so it’s the interpreter that gets killed. – ForceBru Oct 29 '18 at 18:41
  • 1
    You'll probably want to invoke Ghostscript as an external process so that it's isolated from Python. Then if it crashes Python won't be corrupted. Running it in-process means your entire Python process is compromised by a segfault. – John Kugelman Oct 29 '18 at 18:42
  • @ForceBru I guess it's probably a misconception of my part on how dynamically loaded shared dependencies work. My thinking was that the OS actually killed the loaded library (since it's shared) and the program responded by exiting. If it's not the case, I have no idea how `faulthandler` would work. (sigh) I must better do some studying, then... – Not a real meerkat Oct 29 '18 at 18:50
  • @JohnKugelman yes, that is a worthy alternative. It's probably what I'm going to do. – Not a real meerkat Oct 29 '18 at 18:52
  • @UlrichEckhardt the user the application runs in doesn't really have any permissions in the system, aside from writing permissions to a temporary directory (where the pdf document is). It is also inside a docker container. Finally, the documents all come from the same company the application runs on (a big one, sure, but still the same company). An attacker could still jailbreak from those, but if I have someone so determinate to break into the server inside the company, I have way bigger problems. This is also somewhat off-topic for this question, but it is still a useful concern, so thanks. – Not a real meerkat Oct 29 '18 at 18:57
  • @CássioRenan, what do you mean by “the OS killed the library”? It can’t kill a library, because only the programmer can think of code in terms of libraries and stuff like that - for the computer everything is a stream of instructions. Python code is interpreted by the interpreter(sic). This means that all the code, including all the libraries, is executed within one program - the Python interpreter, and the OS has no way of telling libraries apart, because it doesn’t even know it exists. The only thing it can “see” is a stream of instructions that must be executed to make the interpreter work. – ForceBru Oct 29 '18 at 19:07
  • As regards avoiding the segment faults in Ghostscript, this is clearly a bug, so you should open a bug report. If you don;'t do that, then the problem will never be fixed.... (you should also, of course, ensure you are using up-to-date software) – KenS Oct 29 '18 at 19:20
  • @KenS Yes, I know I should, but in order for it to be reproducible I would have to share the documents. Unfortunately I can't do that. – Not a real meerkat Oct 29 '18 at 19:48
  • Well, then the problems likely won't get fixed. The other thing you can do is set -dPDFSTOPONERROR. The default behaviour for Ghostscript is to try and work around any faults found in PDF files (as does Acrobat), the problem with this is, obviously, that this can lead to rogue data being processed and is most commonly the reason for seg faults. If you set -dPDFSTOPONERROR then instead of trying to fix faulty files, Ghostscript will throw an error instead. Error return values can be detected easily enough. – KenS Oct 30 '18 at 08:10
  • Note that the Ghostscript developers do, obviously, treat files in confidence if requested, and we can set the Bugzilla attachments to private to prevent unauthorised users seeing them. – KenS Oct 30 '18 at 08:11
  • @KenS I'm sure they do, but it is still a breach of contract if I share them, privately or not. `-dPDFSTOPONERROR` seems like a good option. I'll try it, thanks! – Not a real meerkat Oct 30 '18 at 15:36
  • To anyone wondering: At the time, we decided to switch to MuPdf. So while I don't need this answer for a project anymore, I am still curious to see if this is possible. – Not a real meerkat Sep 27 '19 at 22:38

1 Answers1

2

I had a similar problem, rendering cad files via pythonocc. Sometimes when opening a file the script just segfaulted. Really annoying. You had to remove the file manually and restart the batch.

So basically the idea is to start an extra process for the task and check it's exitcode:

import multiprocessing as mp


def do_stuff_that_segfaults(param):
    call_shitty_library(param)

def main():
    p = mp.Process(target=do_stuff_that_segfaults, args=param)            
    p.start()            
    p.join()
    if p.exitcode == -11:  # Segmentation fault
        do_stuff_in_case_of_segfault()

I've also tried other suggestions, like the Segmentation Fault Catch you linked to but to no avail. I really would have liked to use mp.pool() to use all cores, but you don't get the exit status from mp.pool().

So far the code runs well and I moved the files resulting in a segfault into another folder via do_stuff_in_case_of_segfault() without getting my main script killed.

Arigion
  • 3,267
  • 31
  • 41