Python-Subprocess-Popen inconsistent behavior in a multi-threaded environment

Question

I have following piece of code running inside thread.. 'executable' produces unique string output for each input 'url':

p = Popen(["executable", url], stdout=PIPE, stderr=PIPE, close_fds=True)
output,error = p.communicate()
print output

when above code gets executed for multiple input 'urls', the subprocess p's 'output' produced is not consistent.For some of the urls, subprocess gets terminated without producing any 'output'. I tried printing p.returncode for each failed 'p' instance(failed urls are not consistent across multiple runs either) and got '-11' as a return code with 'error' value as empty string.Can someone please suggest a way to get consistent behavior/output for each run in a multithreaded environment?

We need more information about your executable and your python program in general. It sounds like it is doing something wrong. Are you sure it is returning properly for those urls you are sending it? Can you do what the executable is doing directly inside python? You could then use the multiprocessing module. — William Denman, Nov 18 '13 at 13:49
@WilliamDenman: hi..executable is a 'C' program binary which produces JSON string output on stdout for a given input 'url'.. and rest of the python code just parses that output string.. let me know if you need further information.. — , Nov 18 '13 at 14:08
@WilliamDenman I am sure that 'C' executable binary is just working fine as in case of serial execution, output produced for each of the url is consistent. I cannot replace C executable with python code.. — , Nov 18 '13 at 14:21
How are you 'executing for multiple input urls'. Please put the entire code of how you are doing this. The way you have it in your question, p.communicate() will block until the process is finished and thus will not perform any work in parallel. — William Denman, Nov 18 '13 at 14:42
@WilliamDenman: I am doing it this way: `urls = ['url1','url2','url3'] for url in urls: Worker(url)` Worker is a thread which executes the code mentioned in the question.. and you are right about p.communicate..you might notice here..I get the work done in parallel because each thread executes the single subprocess instance which processes individual 'url' assigned to that particualar thread.. — , Nov 18 '13 at 14:58
@SagarG: You could push your output on a queue (list) and print it with an extra thread, or you could syncronize the print with a semaphore, or you could use the sys.stdout.write which is atomic AFAIK. — dbra, Nov 18 '13 at 15:06
@dbra: The problem does not seem to be with the 'print' being not-thread-safe.. because when I logged the each thread output in a file.. each of the output is getting written correctly(not really worried about the order of the output)..even for the failed 'urls' I see the exception messages getting written.. — , Nov 18 '13 at 15:18
You really need to include your full code in the question. I have a feeling that @dbra is on the right track. But without the definition of your Worker function, we really will just be guessing at a solution. — William Denman, Nov 18 '13 at 16:07
@WilliamDenman: Please have a look at the code below mentioned by "Sebastian".. It produces '-11' as the return_code for few of the subprocesses..I am now trying to figure out what causes sub-processes to return '-11'.. Is it the 'C executable' or 'python-multithreading'.. — , Nov 19 '13 at 08:13
I am now almost positive that it is your C program that is doing something wonky. What version of Python are you using? Try with the most recent 3.2.x and see if you get the same result. Did you write the C program yourself? If not, then there is no way that you can no for sure that it isn't doing something that isn't thread safe. — William Denman, Nov 19 '13 at 10:23

score 1 · Accepted Answer · edited May 23 '17 at 10:31

-11 as a return code might mean that C program is not fine e.g., you are starting too many subprocesses and it causes SIGSERV in the C executable. You can limit number of concurrent subprocesses using multiprocessing.ThreadPool, concurrent.futures.ThreadPoolExecutor, threading + Queue -based solutions:

#!/usr/bin/env python
from multiprocessing.dummy import Pool # uses threads
from subprocess import Popen, PIPE

def get_url(url):
    p = Popen(["executable", url], stdout=PIPE, stderr=PIPE, close_fds=True)
    output, error = p.communicate()
    return url, output, error, p.returncode

pool = Pool(20) # limit number of concurrent subprocesses
for url, output, error, returncode in pool.imap_unordered(get_url, urls):
    print("%s %r %r %d" % (url, output, error, returncode))

Make sure the executable can be run in parallel e.g., it doesn't use some shared resource. To test, you could run in a shell:

$ executable url1 & executable url2

Could you please explain more about "you are starting too many subprocesses and it causes SIGSERV in the C executable" and possibly solution to avoid that..

Possible problem:

"too many processes"
-> "not enough memory in the system or some other resource"
-> "trigger the bug in the C code that otherwise is hidden or rare"
-> "illegal memory access"
-> SIGSERV

The suggested above solution is:

"limit number of concurrent processes"
-> "enough memory or other resources in the system"
-> "bug is hidden or rare"
-> no SIGSERV

Understand what is SIGSEGV run time error in c++? In short, your program is killed with that signal if it tries to access a memory that it is not supposed to. Here's an example of such program:

/* try to fail with SIGSERV sometimes */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(void) {
  char *null_pointer = NULL;

  srand((unsigned)time(NULL));

  if (rand() < RAND_MAX/2) /* simulate some concurrent condition 
                              e.g., memory pressure */
    fprintf(stderr, "%c\n", *null_pointer); /* dereference null pointer */

  return 0;
}

If you run it with the above Python script then it would return -11 occasionally.

Also p.returncode is not sufficient for debugging purpose..Is there any other option to get more DEBUG info to get to the root cause?

I won't exclude the Python side completely but It is most likely that the problem is the C program. You could use gdb to get a backtrace to see where in a callstack the error comes from.

I am using "subprocess + threading (manual pool) solution using Queue" approach.. pool_size = 100.. Could you please explain more about "you are starting too many subprocesses and it causes `SIGSERV` in the C executable" and possibly solution to avoid that.. — , Nov 19 '13 at 07:59
I tried above code and confirmed that it returns -11 for few subprocesses..For failed cases 'output' and 'error' prints empty string.. `print("%s %r %r %d" % (url, output, error, returncode))` Also p.returncode is not sufficient for debugging purpose..Is there any other option to get more DEBUG info to get to the root cause? — , Nov 19 '13 at 08:19
@SagarG: I've added answers to your questions from the comments. — jfs, Nov 20 '13 at 01:31

score 0 · Answer 2 · answered Nov 19 '13 at 10:23

The return code of -11 seems to indicate that something is not right with your C program.

Generally, if you are trying to use multiple threads, you should know exactly how the program you are calling is implemented. If not, you will encounter, strange and obscure bugs like this.

If you do not have access to the source of the C executable, you will either have to write your own thread-safe version in C or what I would suggest is to implement the external program as a function in Python. Then you can parallelize it with the multiprocessing module.

Python is very good at creating and analysing JSON and it might be a good exercise to re-implement the C program.

Python-Subprocess-Popen inconsistent behavior in a multi-threaded environment

2 Answers2

Linked