How to make the python code with two for loop run faster(Is there a python way of doing Mathematica's Parallelize)?

Question

I am completely new to python or any such programming language. I have some experience with Mathematica. I have a mathematical problem which though Mathematica solves with her own 'Parallelize' methods but leaves the system quite exhausted after using all the cores! I can barely use the machine during the run. Hence, I was looking for some coding alternative and found python kind of easy to learn and implement. So without further ado, let me tell you the mathematical problem and issues with my python code. As the full code is too long, let me give an outline.

1. Numericall solve a differential equation of the form y''(t) + f(t)y(t)=0, to get y(t) for some range, say C <= t <= D

2.Next, Interpolate the numerical result for some desired range to get the function: w(t), say for A <= t <= B

3. Using w(t), to solve another differential equation of the form z''(t) + [ a + b W(t)] z(t) =0 for some range of a and b, for which I am using the loop.

4. Deine F = 1 + sol1[157], to make a list like {a, b, F}. So let me give a prototype loop as this take most of the computation time.

for q in np.linspace(0.0, 4.0, 100):
    for a in np.linspace(-2.0, 7.0, 100):
        print('Solving for q = {}, a = {}'.format(q,a))
        sol1 = odeint(fun, [1, 0], t, args=( a, q))[..., 0]
        print(t[157])
        F = 1 + sol1[157]                    
        f1.write("{}  {} {} \n".format(q, a, F))            
    f1.close()

Now, the real loop takes about 4 hrs and 30 minutes to complete (With some built-in functional form of w(t), it takes about 2 minute). When, I applied (without properly understanding what it does and how!) numba/autojit before the definition of fun in my code, the run time significantly improved and takes about 2 hrs and 30 minute. Also, writing two loops as itertools/product further reduces the run time by about 2 minutes only! However, Mathematica, when I let her use all the 4 cores, finishes the task within 30 minutes.

So, is there a way to improve the runtime in python?

Did you also try to solve for `w` and `z` simultaneously as a coupled system, so that the interpolation of `w` is removed? The bottleneck IMO is the call from Fortran `lsoda` to the python function `fun`, so optimization tips will largely deal with that function. — Lutz Lehmann, May 12 '17 at 21:55
It is actually not possible to solve w and z simultaneously. w must come from the previous equation and in the oscillatory region, I use it to get the behavior of z for different 'a' and 'b'. To give you a physical picture, w(t) can be a thought of as an external periodic force to a pendulum and z(t) is its displacement. — Archimedes, May 12 '17 at 22:26

score 2 · Accepted Answer · edited May 23 '17 at 12:26

To speed up python, you have three options:

deal with specific bottlenecks in the program (as suggested in @LutzL's comment)
try to speed up the code by compiling it into C using cython (or including C code using weave or similar techniques). Since the time-consuming computations in your case are not in python code proper but in scipy modules (at least I believe they are), this would not help you very much here.
implement multiprocessing as you suggested in your original question. This will speed up your code to up to X (slightly less than) times faster if you have X cores. Unfortunately this is rather complicated in python.

Implementing multiprocessing - example using the prototype loop from the original question

I assume that the computations you do inside the nested loops in your prototype code are actually independent from one another. Since your prototype code is incomplete, I am not sure this is the case, however. Otherwise it will, of course, not work. I will give an example using not your differential equation problem for the fun function but a prototype of the same signature (input and output variables).

import numpy as np
import scipy.integrate
import multiprocessing as mp

def fun(y, t, b, c):
    # replace this function with whatever function you want to work with
    #    (this one is the example function from the scipy docs for odeint)
    theta, omega = y
    dydt = [omega, -b*omega - c*np.sin(theta)]
    return dydt

#definitions of work thread and write thread functions

def run_thread(input_queue, output_queue):
    # run threads will pull tasks from the input_queue, push results into output_queue
    while True:
        try:
            queueitem = input_queue.get(block = False)
            if len(queueitem) == 3:
                a, q, t = queueitem
                sol1 = scipy.integrate.odeint(fun, [1, 0], t, args=( a, q))[..., 0]
                F = 1 + sol1[157]
                output_queue.put((q, a, F))
        except Exception as e:
            print(str(e))
            print("Queue exhausted, terminating")
            break

def write_thread(queue):    
    # write thread will pull results from output_queue, write them to outputfile.txt
    f1 = open("outputfile.txt", "w")
    while True:
        try:
            queueitem = queue.get(block = False)
            if queueitem[0] == "TERMINATE":
                f1.close()
                break
            else:
                q, a, F = queueitem                
                print("{}  {} {} \n".format(q, a, F))            
                f1.write("{}  {} {} \n".format(q, a, F))            
        except:
            # necessary since it will throw an error whenever output_queue is empty
            pass

# define time point sequence            
t = np.linspace(0, 10, 201)

# prepare input and output Queues
mpM = mp.Manager()
input_queue = mpM.Queue()
output_queue = mpM.Queue()

# prepare tasks, collect them in input_queue
for q in np.linspace(0.0, 4.0, 100):
    for a in np.linspace(-2.0, 7.0, 100):
        # Your computations as commented here will now happen in run_threads as defined above and created below
        # print('Solving for q = {}, a = {}'.format(q,a))
        # sol1 = scipy.integrate.odeint(fun, [1, 0], t, args=( a, q))[..., 0]
        # print(t[157])
        # F = 1 + sol1[157]    
        input_tupel = (a, q, t)
        input_queue.put(input_tupel)

# create threads
thread_number = mp.cpu_count()
procs_list = [mp.Process(target = run_thread , args = (input_queue, output_queue)) for i in range(thread_number)]         
write_proc = mp.Process(target = write_thread, args = (output_queue,))

# start threads
for proc in procs_list:
    proc.start()
write_proc.start()

# wait for run_threads to finish
for proc in procs_list:
    proc.join()

# terminate write_thread
output_queue.put(("TERMINATE",))
write_proc.join()

Explanation

We define the individual problems (or rather their parameters) before commencing computation; we collect them in an input Queue.
We define a function (run_thread) that is run in the threads. This function computes individual problems until there are none left in the input Queue; it pushes the results into an output Queue.
We start as many such threads as we have CPUs.
We start an additional thread (write_thread) for collecting the results from the output queue and writing them into a file.

Caveats

For smaller problems, you can run multiprocessing without Queues. However, if the number of individual computations is large, you will exceed the maximum number of threads the kernel will allow you after which the kernel kills your program.
There are differences between different operating systems for how multiprocessing works. The example above will work on Linux (perhaps also on other Unix like systems such as Mac and BSD), not on Windows. The reason is that Windows does not have a fork() system call. (I do not have access to a Windows, can therefore not try to implement it for Windows.)

If the bindings for the library he uses release the GIL, threads are an option as well. — spectras, May 12 '17 at 23:23
Thank you a lot. Your answer does solve the question I asked so I am accepting it as an answer but the problem still persists! It really 'Parallelize' the code, however, now the runtime is even greater. Maybe, the problem was not in the loop but in the interpolation as when I use a built-in function for w, it is a lot faster.So the question is **how to interpolate effectively in python?** Can you help me in this regard? — Archimedes, May 13 '17 at 10:09
@Archimedes It seems strange that the computation time should be longer after parallelizing it. In any case, in order to try to speed up certain parts of the computation (interpolating for instance, as you suggested), you may have to share the code of the computation. Also, I think, this might be better placed in a new question. — 0range, May 14 '17 at 12:19
@Orange Yes, it is strange indeed. May be there is something I missed in implementing the palatalizing method to my full code. As I am a novice in coding, I need some time to realize each step and to correctly implement the method you have shown. However, the code is running really fast after I have replaced interp1d with UnivariateSpline:-) — Archimedes, May 15 '17 at 11:25

How to make the python code with two for loop run faster(Is there a python way of doing Mathematica's Parallelize)?

1 Answers1

Linked