4

I'm running a backup script that launches child processes to perform backups by rsync. However I have no way to limit the number of rsyncs it launches at a time.

Here's the code I'm working on at the moment:

print "active_children: ", multiprocessing.active_children()
print "active_children len: ", len(multiprocessing.active_children())
while len(multiprocessing.active_children()) > 49:
   sleep(2)
p = multiprocessing.Process(target=do_backup, args=(shash["NAME"],ip,shash["buTYPE"], ))
jobs.append(p)
p.start()

This is showing a maximum of one child when I have hundreds of rsyncs running. Here's the code that actually launches the rsync (from inside the do_backup function), with command being a variable containing the rsync line:

print command
subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
return 1

If I add a sleep(x) to the do_backup function it will show up as an active child while it's sleeping. Also the process table is showing the rsync processes as having a PPID of 1. I'm assuming from this that the rsync splits off and is no longer a child of python which allows my child process to die so I can't count it anymore. Does anyone know how to keep the python child alive and being counted until the rsync is complete?

nnnmmm
  • 7,964
  • 4
  • 22
  • 41
MVanOrder
  • 1,281
  • 1
  • 10
  • 19

3 Answers3

6

Multiprocessing Pool's

Have you thought about using multiprocessing.Pool's? These allow you to define a fixed number of worker processes which are used to carry out the jobs you want. The key here is in the fixed number which will give you full control over how many instances of rsync you will be launching.

Looking at the example provided in the documentation I linked, first you declare a Pool of n processes, and then you can decide if to map() or apply() (with their respective _async() siblings) your job to the pool.

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    pool = Pool(processes=4)              # start 4 worker processes

    pool.apply_async(f, (10,))    # evaluate "f(10)" asynchronously
    ...
    pool.map(f, range(10))

The obvious advantage here is that you will never unexpectedly fork-bomb your machine as you will spawn only the requested n processes.

Running Your rsync

Your process spawning code would then become something like:

from multiprocessing import Pool

def do_backup(arg1, arg2, arg3, ...):
    # Do stuff

if __name__ == '__main__':
    # Start a Pool with 4 processes
    pool = Pool(processes=4)
    jobs = []

    for ... :
        # Run the function
        proc = pool.apply_async(func=do_backup, args=(shash["NAME"],ip,shash["buTYPE"], ))
        jobs.append(proc)

    # Wait for jobs to complete before exiting
    while(not all([p.ready() for p in jobs])):
        time.sleep(5)

    # Safely terminate the pool
    pool.close()
    pool.join()
JoErNanO
  • 2,458
  • 1
  • 25
  • 27
  • You're missing the bit of magic that'll allow him to actually run a non-python-interpreter process (such as `rsync`): `os.execv` or similar – loopbackbee Oct 28 '14 at 14:05
  • @goncalopp I'm assuming that bit of code already exists in the OP's `do_backup()` function. :) – JoErNanO Oct 28 '14 at 14:15
  • Your answer is more Pythonic than mine +1 – Alex W Oct 28 '14 at 14:18
  • @JoErNanO Yes, of course - but it's the `do_backup` he's having trouble with in the first place! :) As it stands, if you applied your code to his `do_backup`, it'd probably lead to the exact same behaviour he's getting - note that he's not calling `wait`, so the subprocesses in the `Pool` would return immediately while `rsync` is still running. – loopbackbee Oct 28 '14 at 14:29
  • @goncalopp Ah OK i get it now. Basically as his code stands he is forking twice: once for `Popen` and once for `rsync`. So `Popen` will exit while `rsync` is still running. Let me mull this over. – JoErNanO Oct 28 '14 at 14:39
  • It may also be worthwhile to mention that doing another `POpen` (as opposed to `execv`) will duplicate the number of processes unnecessarily. – loopbackbee Oct 28 '14 at 14:40
  • This worked for keeping a count on child processes, but Popen doesn't like to run when I used a pool. Not sure why, I may look into it later. Adding .wait() per goncalopp's reply fixed the issue. – MVanOrder Oct 28 '14 at 14:47
2

Let's clear up some misconceptions first

I'm assuming from this that the rsync splits off and is no longer a child of python which allows my child process to die so I can't count it anymore.

rsync does "split off". On UNIX systems, this is called a fork.

When a process forks, a child process is created - so rsync is a child of python. This child executes independently of the parent - and concurrently ("at the same time").

A process can manage its own children. There are specific syscalls for that, but it's a bit off-topic when talking about python, which has its own high-level interfaces

If you check subprocess.Popen's documentation, you'll notice that it's not a function call at all: it's a class. By calling it, you'll create a instance of that class - a Popen object. Such objects have multiple methods. In particular, wait will allow you to block your parent process (python) until the child process terminates.


With this in mind, let's take a look at your code and simplify it a bit:

p = multiprocessing.Process(target=do_backup, ...)

Here, you're actually forking and creating a child process. This process is another python interpreter (as with all multiprocessing processes), and will execute the do_backup function.

def do_backup()
    subprocess.Popen("rsync ...", ...)

Here, you are forking again. You'll create yet another process (rsync), and let it run "in the background", because you're not waiting for it.


With all this cleared up, I hope you can see a way forward with your existing code. If you want to reduce it's complexity, I recommend you check and adapt JoErNanO's answer, that reuses multiprocessing.Pool to automate keeping track of the processes.

Whichever way you decide to pursuit, you should avoid forking with Popen to create the rsync process - as that creates yet another process unnecessarily. Instead, check os.execv, which replaces the current process with another

loopbackbee
  • 21,962
  • 10
  • 62
  • 97
0

This is not multithreading, but multiprocessing. I'm assuming you're on a Unix system, if you're using rsync although I do believe it can run on Windows systems. In order to control the death of spawned child processes, you must fork them.

There's a good question about doing it in Python here.

Community
  • 1
  • 1
Alex W
  • 37,233
  • 13
  • 109
  • 109
  • Note `subprocess.Popen` is doing the fork in his code. `fork`ing is very low level, and uncommon in python code - there are higher level libraries that help in this kind of tasks (such as `multiprocessing`, as mentioned by JoErNanO) – loopbackbee Oct 28 '14 at 14:01