How to speed up the multiple iterations of non-linear optimization in python?

Question

I am trying to solve a non-linear optimization using the PyGMO package The optimization class is defined separately and then called via a separate function dyn_optimGMO. This optimization has to be done and saved for, say, 1000 random initial vectors defined by variable (nodes) inits ( or init_val)

Using timeit module I found that each iteration takes approx 17 seconds to complete. This means it will take approx 5 hours for 1000 iterations. This is a very large time.

If I have to repeat this for, say, 20 perturb nodes then the total iterations go to 200000 which will take linear time as calculated above.

I tried solving this issue by using python multiprocessing module to parallelize each set of 1000 iterations for each of the 20 perturb nodes. But this doesn't help.

I also tried using Numba jit functions but they don't recognize pyGMO modules and hence fail.

Is there any way to parallelize this code and make it faster for any number of iterations?

Please let me know if my question is clear enough, if not then I will add details as required.

import numpy as np
import pygmo as pg

matL = np.random.rand(300,300) ; node_len = 300

inits = []; results = []


perturb = {25:0} #setting a random node, say, node 25 to 0

class my_constrained_udp:
    
    def __init__(self):
        pass
    
    def fitness(self, x):
        matA = np.matrix(x)
        obj1 = matA.dot(matL).dot(matA.T)[0,0] #this is int value
        ce1 = sum(init_val) - sum(x)                   
        return [obj1, ce1]
   
    def n_objs(self): # no of objectives
        return 1


    def get_nec(self): #no of equality constraints
        return 1   

 
    def get_nic(self): #no of in-equality constraints
        return 0                    


    def get_bounds(self): #lower and upper bounds: use this to perturb the node
        lowerB = np.array([0]*node_len); upperB = np.array([1]*node_len)
        if perturb:
            for k,v in perturb.items():
                lowerB[k] = v; upperB[k] = v
        return (lowerB,upperB)

  
    def gradient(self, x):
        return pg.estimate_gradient_h(lambda x: self.fitness(x), x)


def dyn_optimGMO(matL, node_len ,init):
        
    if perturb:
        for k,v in perturb.items(): init_val[k] = v  #setting perturbations in initial values
    
    inner_algo = pg.nlopt("slsqp"); inner_algo.maxeval = 5
    
    algo = pg.algorithm(uda = pg.mbh(inner_algo, stop = 2, perturb = .2))
    #algo.set_verbosity(10) # in this case this correspond to logs each 1 call to slsqp
    pop = pg.population(prob = my_constrained_udp(), size = 1000 , seed=123)
    pop.problem.c_tol = [1E-6] * 1 # get_nec + get_nic = 1, so multiplied by 1
    pop = algo.evolve(pop) 
   
    res = pop.champion_x   
    return res

# running above optimization code for 1000 random initializations

for i in range(1000):
    init_val = np.array([random.uniform(0, 1) for k in range(node_len)])
    
    if perturb:
        for k,v in perturb.items(): init_val[k] = v  #setting perturbations in initial values
    
    res = dyn_optimGMO(matL ,node_len ,init_val) # this function is defined here only
    
    inits.append(init_val); results.append(res)

EDIT 1:

As suggested below by @Ananda, I made the changes in the objective function which reduced the running time by almost 7x times. I have rewritten the code to run this code over 1000 iterations using python multiprocessing module. Below is my new code where I am trying to spawn processes to process iterations parallely. Since my system has only 8 threads so I have limited the Pool size to 5 only, because PyGMO uses internal parallelization as well and it needs some threads as well

import numpy as np
import pygmo as pg


matL = np.random.rand(300,300) ; node_len = 300

perturb = {12:1} # assign your perturb ID here

def optimizationFN(var):

    results = []
    
    inits = var[0]; perturb = var[1]

    
    class my_constrained_udp:
        
        def fitness(self, x):
            obj1 = x[None,:] @ matL @ x[:,None] # @ is mat multiplication operator
            ce1 = np.sum(inits) - np.sum(x)                   
            return [obj1, ce1]
       
        def n_objs(self): # no of objectives
            return 1
        
        def get_nec(self): #no of equality constraints
            return 1    
        
        def get_nic(self): #no of in-equality constraints
            return 0                    
        
        def get_bounds(self): #lower and upper bounds: use this to perturb the node
            lowerB = np.array([0]*node_len); upperB = np.array([1]*node_len)
            if perturb:
                for k,v in perturb.items():
                    lowerB[k] = v; upperB[k] = v
            return (lowerB,upperB)
        
        def gradient(self, x):
            return pg.estimate_gradient_h(lambda x: self.fitness(x), x)
    
    def dyn_optimGMO(matL, node_len ,inits):
        '''
        perturb should be a dict of node index and value as 0 or 1. Type(node_index) = int
        '''  
        if perturb:
            for k,v in perturb.items(): inits[k] = v  #setting perturbations in initial values
            
        inner_algo = pg.nlopt("slsqp"); inner_algo.maxeval = 5
        
        algo = pg.algorithm(uda = pg.mbh(inner_algo, stop = 2, perturb = .2))
        
        #algo.set_verbosity(10) # in this case this correspond to logs each 1 call to slsqp
        
        pop = pg.population(prob = my_constrained_udp(), size = 100 , seed=123)
        
        pop.problem.c_tol = [1E-6] * 1 # get_nec + get_nic = 1, so multiplied by 1
        pop = algo.evolve(pop) 
       
        res = pop.champion_x   
        return res
    
    
    if perturb:
        for k,v in perturb.items(): inits[k] = v  #setting perturbations in initial values
    
    res = dyn_optimGMO(matL ,node_len ,inits) # this function is defined here only
    
    results.append(res)
    
    return results

import time

st = time.time()
    
#1000 random initialisations
initial_vals = []
for i in range(1000): initial_vals.append(np.array([random.uniform(0, 1) for k in range(node_len)]))
initial_vals = np.array(initial_vals)

inp_val = []
for i in range(len(initial_vals)): inp_val.append([initial_vals[i],perturb])

#running evaluations
#eqVal = optimizationFN(initial_vals,perturb=perturb)
from multiprocessing import Pool


myPool = Pool(8)

data = myPool.map(optimizationFN,inp_val)

myPool.close(); myPool.join()


print('Total Time: ',round(time.time()-st,4))

This executes the entire 1000 iterations in 1.13 hours.

Still, is there any other possibility so that I can make it faster?

CUDA is completely irrelevant to your question and I have removed the tag — talonmies, Dec 27 '20 at 04:39
The code has bugs from what I can tell, please post a working code. For me, it's throwing error at `for k, v in perturb.items(): init_val[k] = v` — Ananda, Dec 27 '20 at 04:47
You're asking for a code review - that's not really what StackOverflow is for. If you have a specific idea on how to improve this, and you've tried but run into trouble, please describe the specific problem and ask for advice on that. — Grismar, Dec 27 '20 at 05:06
@Grismar I am not asking for a code review. I have specifically mentioned that I can run the above code but can't find a way to make it run fast or parallelize it further. — Ashutosh, Dec 27 '20 at 05:24
You're asking for others to "find a way to make it run fast or parallelize it further" - i.e., a code review? What specific problem do you need solved? What did you try? What is the problem you're having with it? — Grismar, Dec 27 '20 at 10:51

Ananda · Accepted Answer · 2020-12-27T11:33:28.270

Before trying to parallelize etc. try to figure out what exactly is the bottleneck for the performance and try to fix that.

If you profile your fitness function using line profiler,

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    28                                               @profile
    29                                               def fitness(self, x):
    30     96105    3577978.0     37.2      9.5          matA = np.matrix(x)
    31     96105    8548353.0     88.9     22.7          obj1 = matA.dot(matL).dot(matA.T)[0,0] #this is int value
    32     96105   25328835.0    263.6     67.4          ce1 = sum(init_val) - sum(x)
    33     96105     121800.0      1.3      0.3          return [obj1, ce1]

As you can see, most time is spent in the dot and sum function with a significant time also spent creating matA.

I would rewrite the function like so -

def fitness(self, x):

    obj1 = x[None, :] @ matL @ x[:, None]
    ce1 = np.sum(init_val) - np.sum(x)

    return [obj1, ce1]

If you profile ths function you can see,

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    20                                               @profile
    21                                               def fitness(self, x):
    22                                           
    23     77084    3151649.0     40.9     48.9          obj1 = x[None, :] @ matL @ x[:, None]
    24     77084    3214012.0     41.7     49.9          ce1 = np.sum(init_val) - np.sum(x)
    25                                           
    26     77084      79439.0      1.0      1.2          return [obj1, ce1]

The total time per hit for the full function went down from around 380 to 80.

np.matrix method is recommended not to be used anymore and is going to deprecated. And using native python sum instead of np.sum can reduce the performance by a lot.

On my machine, it gives a performance improvement from 33 sec/it to 6 sec/iteration. Around 5x gain in performance.

@Naveen I am using line_profiler https://github.com/rkern/line_profiler — Ananda, Dec 27 '20 at 11:34
This actually improves the running time a lot, almost 7x faster in my machine. LineProfiler is a great tool. However, can you also suggest a way to parallelize the above execution to make it even faster? — Ashutosh, Dec 27 '20 at 18:23
I doubt it, if you want to push it even more than this, you would probably move away from Python and use C++ or something. Multiprocessing + making sure that things are properly vectorised is probably the best you can do in python as far as I am aware. — Ananda, Dec 28 '20 at 03:13

score -1 · Answer 2 · edited Dec 28 '20 at 13:03

-1

Q : "Is there any way to parallelize this code and make it faster for any number of iterations?"

Yes. Given an attempt to numba.jit() the code "from outside" ( which failed for the reasons Numba compilation is warning about ), you may resort to distribute the parts of the batch of about the said 1k+ independent initialisations and let compute these in parallel and collect results afterwards.

The benefit of this is a 1000x gain in performance in a snap, and is scalable further.

If you're using a university cluster of about 1k+ nodes, your 1k batch of computation could yield results within about the same time a solo-run would do the first from the 1k-long sequence ( communication costs are negligible here, see Amdahl's Argument ).

edited Dec 28 '20 at 13:03

halfer

19,824
17
99
186

answered Dec 27 '20 at 05:48

user3666197

1
6
50
92

2

I haven't voted, but I wonder if the downvotes are for the reasons that people have been telling you for a while: this is grandiloquent writing that does not say very much, and features an excessive application of formatting that makes posts harder to read. – halfer Dec 28 '20 at 11:31
Would you mind to explain @halfer where have you arrived to a conclusion "does not say very much"? It did. It a) shown,why numba-related expectations failed & will fail again + b) directed user to the one & the only one viable direction to indeed boost an E2E-performance (using distributed-computing,where resources work on so called "embarasingly parallel" loads, that no other, emphasised once more *NO OTHER* approach on HW-arch.s present in 2020/Q4 is anywhere near capable of? Your previous pieces of advice got respected.Occam razor leads to say maximum in least amount of words,doesn't it? – user3666197 Dec 28 '20 at 12:32
1

I have made an edit to suggest how this answer could be made easier on the eye, and thus easier to read. Unnecessary words have been trimmed, which makes it easier to focus on the core point. – halfer Dec 28 '20 at 13:05

How to speed up the multiple iterations of non-linear optimization in python?

2 Answers2