5

An issue I have with Python's (3.4) subprocess.popen:

Very rarely (once in several thousands), calls to popen seem to create another forked process, in addition to the intentional process, and hanging (possibly waiting?), resulting in the intentional process becoming a zombie.

Here's the call sequence:

with subprocess.Popen(['prog', 'arg1', 'arg2'], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE) as p:
    std_out, std_err = p.communicate()
    p.wait()

Note: the above call sequence is run itself from a forked process (a form of process pooling, see process list below)

The issue happens with multiple programs (7z for example) so I assume the problem is with the caller and not the callee.

prog is zombiefied, so I assume the p.wait() statement is never reached or not executed properly.

The resulting process list (ps -ef output):

my_user  18219 18212  9 16:16 pts/1    00:18:11 python3 script.py        # original process
my_user  1045  18219  0 16:18 pts/1    00:00:14 python3 script.py        # Intentionally forked from original (poor man's process pool) - Seems to be stuck or waiting
my_user  2834  1045   0 16:18 pts/1    00:00:00 [prog] <defunct>         # Program run by subprocess.popen - Zombie
my_user  2841  1045   0 16:18 pts/1    00:00:00 python3 script.py        # !!!! Should not be here, also stuck or waiting, never finishes

Edited (added code sample as requested): The code in questions:

import os
import subprocess

pid = os.fork()
if pid == 0:
    # child
    file_name='test.zip'
    out_dir='/tmp'

    while True:
        with subprocess.Popen(['7z', 'x', '-y', '-p', '-o' + out_dir, file_name], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE) as p:
            try:
                std_out, std_err = p.communicate(timeout=600)
            except subprocess.TimeoutExpired:
                p.kill()
                std_out, std_err = p.communicate()
                logging.critical('7z failed, a timeout has occurred during waiting')
            except:
                p.kill()
                p.wait()
                raise
            return_code = p.poll()

        # do something
else:
    # parent
    wpid, status = os.waitpid(pid, 0)
    exit_code = status >> 8
Jonathan
  • 51
  • 1
  • 4
  • I don't believe that `Popen()` may create more than one child process. Could you provide [a minimal but complete code example that reproduces the behavior](http://stackoverflow.com/help/mcve)? Does your parent Python script use threads? – jfs Dec 01 '15 at 08:20
  • @J.F.Sebastian - Added the code as requested. – Jonathan Dec 01 '15 at 11:35
  • is there a reason, not to use `multiprocessing/furures` or `subprocess` instead of `os.fork()`? – jfs Dec 01 '15 at 17:30
  • @J.F.Sebastian - the reason is irrelevant, I encountered a weird behavior that I want to understand. That said, I used multiprocessing.Pool instead of fork, but some child processes used multiprocessing.Process to spawn children (a library, not my code), which fails ('daemonic processes are not allowed to have children'), switched to fork to create a pool of my own. Subprocess (Popen, check_output, etc.) will not do in my case because they execute a file and I need a clone of my process (I do use it to run 7z though) I did not try futures. I might give it a go – Jonathan Dec 02 '15 at 19:33
  • after `fork()`: [_"to avoid errors, the child process may only execute **async-signal-safe** operations until such time as one of the **`exec`** functions is called"_](http://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html) – jfs Dec 02 '15 at 19:35
  • @J.F.Sebastian As far as I can tell, what I do (subprocess.Popen - which is a fork and exec) is async-signal-safe. And p.communicate() basically reads from pipes, which I believe is also async-signal-safe. – Jonathan Dec 06 '15 at 10:15
  • replace `os.fork()` with `subprocess.Popen()` (move child's code into a separate script) and see whether you can reproduce the error. – jfs Dec 06 '15 at 10:21
  • This would be a very significant change, since I rely on fork's copy-on-write behavior, the child process has all the parent's context such as configuration, and loaded buffers. Using multiprocessing.Process might be easier and interesting to test since context can be passed, but still it is quite a change. – Jonathan Dec 06 '15 at 12:02
  • I'm talking about [mcve](http://stackoverflow.com/help/mcve), not your production code. If you were to reproduce it without `os.fork()`; it might point to the actual issue (the purpose is to get information (insight)). – jfs Dec 06 '15 at 12:31
  • Unfortunately, I can't add any insight, but I have this exact problem in Python 2.7.5. I have multiple workers running in a multiprocessing pool. These workers use `hg identify` in a subprocess `check_output` to get the current hg hash. Perhaps one in a thousand times, the `communicate` part of `check_output` hangs waiting for a result because the hg command has turned into a zombie. The workaround is easy for me - get the hash once at startup. But I believe your problem is quite real! – Jim Hunziker Jan 24 '17 at 18:54

2 Answers2

2

I believe this is an effect of mixing forking and threading, which is a bad thing to do in Linux. Here are a couple references:

I believe your process is multithreaded once you import the logging module. (In my case, I was sometimes seeing my program hang while waiting on a logging futex and sometimes hang while waiting inside subprocess with the subprocess having become a zombie.) That module uses OS locks to ensure that it can be called in a thread-safe manner. Once you fork, that lock's state is inherited by the child process. So the child (which is single threaded but inherited the memory of the parent) can't acquire the logging lock because the lock was sometimes locked when the fork happened.

(I'm not super confident in my explanation. My problem went away when I switched from using multiprocessing's default fork behavior to using spawn behavior. In the latter, a child does not inherit its parent's memory, and subprocess and logging no longer caused hangs for me.)

Community
  • 1
  • 1
Jim Hunziker
  • 14,111
  • 8
  • 58
  • 64
1

subprocess indeed forks before running the command. This is mentionned in PEP 324 (ctrl-f for “fork”).

The reason is that the command is run using exec, which replaces the calling process by the executed one.

As you can see, it shares the same pid as the executed script, so it actually is the same process, but it is not the python interpreter that is being run.

So, as long as the child process does not return, the caller python process can't.

Valentin Lorentz
  • 9,556
  • 6
  • 47
  • 69
  • Of course popen forks and then execs, but from the process list you could see that process 2834 and 2841 shares the same PPID (have the same parent), not PID (not the same process), process 2834 is the one I expected to be forked, but I also got 2841 - two forks instead of one. – Jonathan Nov 30 '15 at 21:50
  • the fork creates a *new* process. The parent python process may continue to run whatever happens in the child process. – jfs Dec 01 '15 at 08:15
  • @J.F.Sebastian - I understand the parent continues, but what it does (or at least meant to do) is to wait for the child (either through communicate or wait methods - see code added to the questions), so the extra child is unexpected. – Jonathan Dec 01 '15 at 11:39
  • @Jonathan: there won't be an extra child (created by a single Popen()) whether the parent waits or does something else. – jfs Dec 01 '15 at 17:24
  • @J.F.Sebastian - I know there shouldn't be another process, this is the reason I submitted this question here. You can see the code, and the resulting process list. Also, if I add prints everywhere it seems the original process is stuck in the p.communicate line, so technically no other statements execute. Nevertheless the extra process exists. – Jonathan Dec 02 '15 at 18:32
  • @Jonathan: `ps -ef` output does not show two processes: it shows one dead process (zombie -- just an entry in a process table) and one alive. Though it also shouldn't happen (`.poll()`, `.wait()` (it is also called by `.communicate()`) should reap the dead process before a new one is started. – jfs Dec 02 '15 at 18:43
  • @J.F.Sebastian That is exactly what I asked, it seems that the parent process is stuck in the communicate() call for the first child process, even though it is finished (the zombie - pid 2834), and I haven't the faintest idea how and why the second child is created (pid 2841). You can see the code, there is no line in the code that would create it. – Jonathan Dec 02 '15 at 19:18