how to start multiple jobs in python and communicate with the main job

Question

I am a novice user of python multithreading/multiprocessing, so please bear with me. I would like to solve the following problem and I need some help/suggestions in this regard. Let me describe in brief:

I would like to start a python script which does something in the beginning sequentially.
After the sequential part is over, I would like to start some jobs in parallel.
- Assume that there are four parallel jobs I want to start.
- I would like to also start these jobs on some other machines using "lsf" on the computing cluster.My initial script is also running on a ” lsf” machine.
- The four jobs which I started on four machines will perform two logical steps A and B---one after the other.
- When a job started initially, they start with logical step A and finish it.
- After every job (4jobs) has finished the Step A; they should notify the first job which started these. In other words, the main job which started is waiting for the confirmation from these four jobs.
- Once the main job receives confirmation from these four jobs; it should notify all the four jobs to do the logical step B.
- Logical step B will automatically terminate the jobs after finishing the task.
- Main job is waiting for the all jobs to finish and later on it should continue with the sequential part.
An example scenario would be:
- Python script running on an “lsf” machine in the cluster starts four "tcl shells" on four “lsf” machines.
- In each tcl shell, a script is sourced to do the logical step A.
- Once the step A is done, somehow they should inform the python script which is waiting for the acknowledgement.
- Once the acknowledgement is received from all the four, python script inform them to do the logical step B.
- Logical step B is also a script which is sourced in their tcl shell; this script will also close the tcl shell at the end.
- Meanwhile, python script is waiting for all the four jobs to finish.
- After all four jobs are finished; it should continue with the sequential part again and finish later on.

Here are my questions:

I am confused about---should I use multithreading/multiprocessing. Which one suits better? In fact what is the difference between these two? I read about these but I wasn't able to conclude.
What is python GIL? I also read somewhere at any one point in time only one thread will execute. I need some explanation here. It gives me an impression that I can't use threads.
Any suggestions on how could I solve my problem systematically and in a more pythonic way. I am looking for some verbal step by step explanation and some pointers to read on each step. Once the concepts are clear, I would like to code it myself.

Thanks in advance.

Take look at [celery](http://docs.celeryproject.org/en/latest/index.html#). It should solve most/all of your queries here. — Sanket Sudake, Nov 25 '16 at 10:22

roganjosh · Answer 1 · 2016-11-25T12:09:19.953

1) From the options you listed in your question, you should probably use multiprocessing in this case to leverage multiple CPU cores and compute things in parallel.

2) Going further from point 1: the Global Interpreter Lock (GIL) means that only one thread can actually execute code at any one time.
A simple example for multithreading that pops up often here is having a prompt for user input for, say, an answer to a maths problem. In the background, they want a timer to keep incrementing at one second intervals to register how long the person took to respond. Without multithreading, the program would block whilst waiting for user input and the counter would not increment. In this case, you could have the counter and the input prompt run on different threads so that they appear to be running at the same time.
In reality, both threads are sharing the same CPU resource and are constantly passing an object backwards and forwards (the GIL) to grant them individual access to the CPU. This is hopeless if you want to properly process things in parallel. (Note: In reality, you'd just record the time before and after the prompt and calculate the difference rather than bothering with threads.)

3) I have made a really simple example using multiprocessing. In this case, I spawn 4 processes that compute the sum of squares for a randomly chosen range. These processes do not have a shared GIL and therefore execute independently unlike multithreading. In this example, you can see that all processes start and end at slightly different times, but we can aggregate the results of the processes into a single queue object. The parent process will wait for all 4 child processes to return their computations before moving on. You could then repeat the code for func_B (not included in the code).

import multiprocessing as mp
import time
import random
import sys

def func_A(process_number, queue):
    start = time.time()
    print "Process {} has started at {}".format(process_number, start)
    sys.stdout.flush()
    my_calc = sum([x**2 for x in xrange(random.randint(1000000, 3000000))])

    end = time.time()
    print "Process {} has ended at {}".format(process_number, end)
    sys.stdout.flush()
    queue.put((process_number, my_calc))

def multiproc_master():
    queue = mp.Queue()

    processes = [mp.Process(target=func_A, args=(x, queue)) for x in xrange(4)]
    for p in processes:
        p.start()

    # Unhash the below if you run on Linux (Windows and Linux treat multiprocessing
    # differently as Windows lacks os.fork())
    #for p in processes:
    #    p.join()

    results = [queue.get() for p in processes]
    return results

if __name__ == '__main__':
    split_jobs = multiproc_master()
    print split_jobs

Roganjosh: I got the idea what you proposed. In func_A, you calculate sum of the squares for a randomly chosen number. This is a simple python function you tried. In my case, I wanted open "tclsh" shell on a remote machine and source a tcl script. This script will take sometime to run, for instance 1 hour. Should I use subproess to start the tclshell ? and use pipes to read from stdout and stderr? How to identify the step A has finished? Should I look for some specific word in stdout to identify that Step A has finished? — boppu, Nov 25 '16 at 13:18
The complexity of the function here I don't think is relevant. Whack a few 0s on and it will still behave the same i.e. process time isn't an issue. This code implicitly identifies the end of all child `func_A` calls since `results` will not compute until all processes have returned. I think beyond what I have here, you might want to look at blocking/non-blocking subprocess calls http://stackoverflow.com/questions/21936597/blocking-and-non-blocking-subprocess-calls. You could spawn blocking processes from the multiprocesses, or nonblocking and have some flag for when they all complete — roganjosh, Nov 25 '16 at 13:26

RaJa · Accepted Answer · 2016-11-25T13:09:21.337

0

In addition to roganjosh's answer, I would include some signaling to start the step B after A has finished:

import multiprocessing as mp
import time
import random
import sys

def func_A(process_number, queue, proceed):
    print "Process {} has started been created".format(process_number)

    print "Process {} has ended step A".format(process_number)
    sys.stdout.flush()
    queue.put((process_number, "done"))

    proceed.wait() #wait for the signal to do the second part
    print "Process {} has ended step B".format(process_number)
    sys.stdout.flush()

def multiproc_master():
    queue = mp.Queue()
    proceed = mp.Event()

    processes = [mp.Process(target=func_A, args=(x, queue)) for x in range(4)]
    for p in processes:
        p.start()

    #block = True waits until there is something available
    results = [queue.get(block=True) for p in processes]
    proceed.set() #set continue-flag
    for p in processes: #wait for all to finish (also in windows)
        p.join()
    return results

if __name__ == '__main__':
    split_jobs = multiproc_master()
    print split_jobs

edited Nov 25 '16 at 13:09

answered Nov 25 '16 at 11:43

RaJa

1,471
13
17

RaJa: Can you explain why there two joins in the code? – boppu Nov 25 '16 at 13:01
Well, that clearly is an error. You only need the last. I have copy/pasted the other solution and modified it a bit. I have edited my solution and removed the false join-command. – RaJa Nov 25 '16 at 13:08
Do not use `.join()` on Windows. http://stackoverflow.com/questions/39896807/multiprocessing-process-does-not-join-when-putting-complex-dictionary-in-return/39899602#39899602 . At best it will cause the processes to execute sequentially, at worst you end up with zombie processes. – roganjosh Nov 25 '16 at 13:10
You might be right, but I have never run into problems and my processes clearly run in parallel. I am using Windows and Python 2.7/3. – RaJa Nov 25 '16 at 13:19
Roganjosh/RaJa: I got the idea what you proposed. In func_A, you calculate sum of the squares for a randomly chosen number. This is a simple python function you tried. In my case, I wanted open "tclsh" shell on a remote machine and source a tcl script. This script will take sometime to run, for instance 1 hour. Should I use subproess to start the tclshell ? and use pipes to read from stdout and stderr? How to identify the step A has finished? Should I look for some specific word in stdout to identify that Step A has finished? – boppu Nov 25 '16 at 13:21

how to start multiple jobs in python and communicate with the main job

2 Answers2

Linked