0

I am running into a weird deadlock situation in my code using Python 3.8. The implementation kicks off a separate Process to perform some operations on a PDF/XPS file and then return the results. Occasionally, it will never return, and I am perplexed why it is happening. I cannot show the entire implementation but it is structured like this:

def parent_function():

    ... (other code)

    results_queue = multiprocessing.Queue()
    child_process = multiprocessing.Process(target=_process_pdf_pages, args=(original_file, arg1, arg2, arg3,..., results_queue))
    child_process.start()
    logger.info('BEGIN results = results_queue.get()')
    results = results_queue.get()
    logger.info('END results = results_queue.get()')

    ... (other code)


def _process_pdf_pages(original_file, arg1, arg2, arg3,..., results_queue):
    try:
        logger.info('{} Started reading PDF/XPS file {}'.format(dt.datetime.now(), original_file))
        ... (other code)
        logger.info('{} Finished reading PDF/XPS file {}'.format(dt.datetime.now(), original_file))

        ... (other code)

        logger.info('Child process returning result')
        results_queue.put((arg1, arg2, arg3...))
    
    except Exception as e:
        logger.error('Child process encountered error: {}'.format(e))
        logger.error(traceback.format_exc())
        logger.info('Child process returning result')
        results_queue.put((arg1, arg2, arg3...))

Whenever this code deadlocks, I see the following lines written in the logs and nothing after this:

2023-06-20 14:12:55 BEGIN results = results_queue.get()
2023-06-20 14:12:55 2023-06-20 14:12:55.075017 Started reading PDF/XPS file my_file.pdf
2023-06-20 14:12:55 2023-06-20 14:12:55.745496 Finished reading PDF/XPS file my_file.pdf

Strangely, I do not see the message the child process prints out just before it calls .put on the queue, but it does appear to be a deadlock because I don't observe any CPU usage indicating the child process was still busy and I know by experience that it only takes a few seconds to process these files.

Is there anything I'm doing wrong with the order of operations that is causing this problem?

Travis Lu
  • 51
  • 1
  • 3
  • You don't need to show your entire implementation, but it's helpful to provide a [mcve] -- **runnable** code that reproduces the problem. We'll have a much easier time helping you if we can run and reproduce the problem locally. – larsks Jun 20 '23 at 15:26
  • Hi Larsks. Unfortunately even I am having trouble reproducing this. It is happening in our production environment and I was not able to reproduce it locally or in our UAT environment using the same files and similar setup. – Travis Lu Jun 20 '23 at 15:44
  • I should add here that it appears it only happens in production environment when alot of copies of this program (in docker containers) are processing files at once. However I was unable to reproduce it by doing the same outside of production environment, although the hardware setup is not identical and that may be why. – Travis Lu Jun 20 '23 at 15:48
  • It may be in the `... (other code)` between log messages. Which operating system? They'll have tools to inspect processes. After starting the process you could print its pid and then - as a linux example - `ps -p -o state,wchan` to see its state and where it is waiting in the kernel. You could put a timeout on the `.get()` so that you notice the problem early. And as a dirty hack, maybe the problem pdf could be reprocessed? – tdelaney Jun 20 '23 at 15:57
  • Hi, yes I'm using Linux, and yeah I'm also starting to suspect the process is getting killed for some reason. I will get it to print out the pid and try. – Travis Lu Jun 20 '23 at 16:06
  • given that the child-process doesnt reach its exception handler or its last regular logger.info-message, it is quite natural that the main-process waits forever. Can you try to find the precise line in at which the child-process stops? It must be in your last ...(other code)-block. Is it always the same line? – julaine Jun 21 '23 at 06:13
  • (I realize you said it does not always happen but I wonder if it is always the same line of code reached when it does happen) – julaine Jun 21 '23 at 06:15
  • Can you install a signal handler in your child-process so we can see if it receives anything from the outside? https://stackoverflow.com/a/1112350/7465516 – julaine Jun 21 '23 at 06:25

0 Answers0