Multicore programming

Question

At the company where I am interning, I was told about the use of multi-core programming and, in view of a project I am developing for my thesis (I'm not from the area but I'm working on something that involves coding).

I want to know if this is possible:
I have a defined function that will be repeated 3x for 3 different variables. Is it possible to put the 3 running at the same time in different core (because they don't need each other information)? Because the calculation process is the same for all of them and instead of running 1 variable at a time, I would like to run all 3 at once (performing all the calculations at the same time) and in the end returning the results.

Some part of what I would like to optimize:

for v in [obj2_v1, obj2_v2, obj2_v3]:
    distancia_final_v,       \
    pontos_intersecao_final_v = calculo_vertice( obj1_normal,
                                                 obj1_v1,
                                                 obj1_v2,
                                                 obj1_v3,
                                                 obj2_normal,
                                                 v,
                                                 criterio
                                                 )

def calculo_vertice( obj1_normal,
                     obj1_v1,
                     obj1_v2,
                     obj1_v3,
                     obj2_normal,
                     obj2_v,
                     criterio
                     ):
    i = 0
    distancia_final_v = []
    pontos_intersecao_final_v = []

    while i < len(obj2_v):

        distancia_relevante_v = []
        pontos_intersecao_v   = []
        distancia_inicial     = 1000

        for x in range(len(obj1_v1)):

            planeNormal = np.array( [obj1_normal[x][0],
                                     obj1_normal[x][1],
                                     obj1_normal[x][2]
                                     ] )
            planePoint  = np.array( [    obj1_v1[x][0],
                                         obj1_v1[x][1],
                                         obj1_v1[x][2]
                                         ] )          # Any point on the plane
            rayDirection = np.array([obj2_normal[i][0],
                                     obj2_normal[i][1],
                                     obj2_normal[i][2]
                                     ] )              # Define a ray
            rayPoint     = np.array([     obj2_v[i][0],
                                          obj2_v[i][1],
                                          obj2_v[i][2]
                                          ] )         # Any point along the ray

            Psi = Calculos.line_plane_collision( planeNormal,
                                                 planePoint,
                                                 rayDirection,
                                                 rayPoint
                                                 )

            a   = Calculos.area_trianglo_3d( obj1_v1[x][0],
                                             obj1_v1[x][1],
                                             obj1_v1[x][2],
                                             obj1_v2[x][0],
                                             obj1_v2[x][1],
                                             obj1_v2[x][2],
                                             obj1_v3[x][0],
                                             obj1_v3[x][1],
                                             obj1_v3[x][2]
                                             )
            b   = Calculos.area_trianglo_3d( obj1_v1[x][0],
                                             obj1_v1[x][1],
                                             obj1_v1[x][2],
                                             obj1_v2[x][0],
                                             obj1_v2[x][1],
                                             obj1_v2[x][2],
                                             Psi[0][0],
                                             Psi[0][1],
                                             Psi[0][2]
                                             )
            c   = Calculos.area_trianglo_3d( obj1_v1[x][0],
                                             obj1_v1[x][1],
                                             obj1_v1[x][2], 
                                             obj1_v3[x][0],
                                             obj1_v3[x][1],
                                             obj1_v3[x][2],
                                             Psi[0][0],
                                             Psi[0][1],
                                             Psi[0][2]
                                             )
            d   = Calculos.area_trianglo_3d( obj1_v2[x][0],
                                             obj1_v2[x][1],
                                             obj1_v2[x][2],
                                             obj1_v3[x][0],
                                             obj1_v3[x][1],
                                             obj1_v3[x][2],
                                             Psi[0][0],
                                             Psi[0][1],
                                             Psi[0][2]
                                             )

            if float("{:.5f}".format(a)) == float("{:.5f}".format(b + c + d)):

                P1 = Ponto(    Psi[0][0],    Psi[0][1],    Psi[0][2] )
                P2 = Ponto( obj2_v[i][0], obj2_v[i][1], obj2_v[i][2] )

                distancia = Calculos.distancia_pontos( P1, P2 ) * 10

                if distancia < distancia_inicial and distancia < criterio:
                    distancia_inicial     = distancia
                    distancia_relevante_v = []
                    distancia_relevante_v.append( distancia_inicial )
                    pontos_intersecao_v   = []
                    pontos_intersecao_v.append( Psi )

            x += 1

        distancia_final_v.append( distancia_relevante_v )
        pontos_intersecao_final_v.append( pontos_intersecao_v )

        i += 1

    return distancia_final_v, pontos_intersecao_final_v

In this example of my code, I want to make the same process happen for obj2_v1, obj2_v2, obj2_v3.

Is it possible to make them happen at the same time?

Because I will be using a considerable amount of data and it would probably save me some time of processing.

If've indicated that ***"I will be using a considerable amount of data"***, the root-cause of the problems here are inefficient instructions - may start from performance blockers like this: ***`if float( "{:.5f}".format( a ) ) == float( "{:.5f}".format( b + c + d ) )`*** & avoid other obvious anti-patterns - like assignment of an empty list ( `= []` ) the very line above an instruction to `.append()` a new value to the just assigned list-instance. Multicore programming has nothing to do with this problem - as described, it's embarrasingly parallel, may use O/S GNU `parallel` & run 3 jobs/data — user3666197, May 25 '20 at 17:17

Henrique · Answer 1 · 2020-05-25T16:43:16.413

It's possible, but use python multiprocessing lib, because the threading lib doesn't delivery parallel execution.

UPDATE

DON'T do something like that (thanks for @user3666197 for pointing the error) :

from multiprocessing.pool import ThreadPool

def calculo_vertice(obj1_normal,obj1_v1,obj1_v2,obj1_v3,obj2_normal,obj2_v,criterio):
      #your code
      return distancia_final_v,pontos_intersecao_final_v

pool = ThreadPool(processes=3)
async_result1 = pool.apply_async(calculo_vertice, (#your args here))
async_result2 = pool.apply_async(calculo_vertice, (#your args here))
async_result3 = pool.apply_async(calculo_vertice, (#your args here))  

result1 = async_result1.get()  # result1
result2 = async_result2.get()  # result2
result3 = async_result3.get()  # result3

Instead, something like this should do the job:

from multiprocessing import Process, Pipe

def calculo_vertice(obj1_normal,obj1_v1,obj1_v2,obj1_v3,obj2_normal,obj2_v,criterio, send_end):
      #your code
      send_end.send((distancia_final_v,pontos_intersecao_final_v))

numberOfWorkers = 3
jobs = []
pipeList = []

#Start process and build job list
for i in range(numberOfWorkers):
    recv_end, send_end = Pipe(False)
    process = Process(target=calculo_vertice, args=(#<... your args...>#, send_end))
    jobs.append(process)
    pipeList.append(recv_end)
    process.start()

#Show the results
for job in jobs: job.join()
resultList = [x.recv() for x in pipeList]
print (resultList)

REF.

https://docs.python.org/3/library/multiprocessing.html https://stackoverflow.com/a/37737985/8738174

This code will create a pool of 3 working process and each of it will async receive the function. It's important to point that in this case you should have 3+ CPU cores, otherwise, your system kernel will just switch between process and things won't real run in parallel.

With all due respect, could you explain what benefit do you expect from GIL-lock (still) re-**`[SERIAL]`**-ised flow of excution? The original "amount" of work will get executed from -3- threads, yet at no faster pace, as each thread, before it may execute a small amount of it's own work has first to grab the (still) central GIL-lock, and has to wait idle until it indeed gets it POSACK-ed -- i.e. your proposed strategy means, that python will spend more work on GIL-thrashing, yet having to do the same "amount" of work, still not more than -1- and only -1- CPU-core working at a time? — user3666197, May 25 '20 at 15:33

score 0 · Answer 2 · answered May 25 '20 at 16:34

multiprocessing (using processes to avoid the GIL) is the easiest but you're limited to relatively small performance improvements, number of cores speedup is the limit, see Amdahl's law. there's also a bit of latency involved in starting / stopping work which means it's much better for things that take >10ms

in numeric heavy code (like this seems to be) you really want to be moving as much of the it "inside numpy", look at vectorisation and broadcasting. this can give speedups of >50x (just on a single core) while staying easier to understand and reason about

if your algorithm is difficult to express using numpy intrinsics then you could also look at using Cython. this allows you to write Python code that automatically gets compiled down to C, and hence a lot faster. 50x faster is probably also a reasonable speedup, and this is still running on a single core

the numpy and Cython techniques can be combined with multiprocessing (i.e. using multiple cores) to give code that runs hundreds of times faster than naive implementations

Jupyter notebooks have friendly extensions (known affectionately as "magic") that make it easier to get started with this sort of performance work. the %timeit special allows you to easily time parts of the code, and the Cython extension means you can put everything into the same file

With all due respect, the speedups from the original Amdahl's Law are nonsensical in the realms of python computing - here you have to use a **revised, *overhead-strict*** re-formulation of the original, ***overhead-naive*** Amdahl's Law formulation - for details see: https://stackoverflow.com/revisions/18374629/3 **+** https://stackoverflow.com/questions/60128189/poor-scaling-of-multiprocessing-pool-map-on-a-list-of-large-objects-how-to-ac/60427809#60427809 and — user3666197, May 25 '20 at 16:47

user3666197 · Answer 3 · 2020-05-26T10:21:35.717

Q : " Is it possible to make them happen at the same time? "

Yes.

The best results ever will be get if not adding any python ( the multiprocessing module is not necessary at all for doing 3 full-copies ( yes, top-down fully replicated copies ) of the __main__ python process for this so embarrasingly independent processing ).

The reasons for this are explained in detail here and here.

A just-enough tool is GNU's :
$ parallel --jobs 3 python job-script.py {} ::: "v1" "v2" "v3"

For all performance-tweaking, read about more configuration details in man parallel.

_{"Because I will be using a considerable amount of data..."}

The Devil is hidden in the details :

The O/P code may be syntactically driving the python interpreter to results, precise (approximate) within some 5 decimal places, yet the core sin is it's ultimately bad chances to demonstrate any reasonably achievable performance in doing that, the worse on "considerable amount of data".

If they, "at the company", expect some "considerable amount of data", you should do at least some elementary research on what is the processing aimed at.

The worst part ( not mentioning the decomposition of once vectorised-ready numpy-arrays back into atomic "float" coordinate values ) is the point-inside-triangle test.

For a brief analysis on how to speed-up this part ( the more if going to pour "considerable amount of data" on doing this ), get inspired from this post and get the job done in fraction of the time it was drafted in the O/P above.

Indirect testing of a point-inside-triangle by comparing an in-equality of a pair of re-float()-ed-strings, received from sums of triangle-areas ( b + c + d ) is just one of the performance blockers, you will find to get removed.

Multicore programming

3 Answers3

The Devil is hidden in the details :