Convert for-loops to parallel computing

Question

I do a numerical simulation. I have 50,000 particles that crystallize in a box, in which there are several crystals (with periodic boundary condition). I try to find the biggest crystal in the box. The code works well, but it takes about 100 [s] on my computer. The rate determining step is the function find_neighbors() below.

Is there a way to use parallel computing to shorten the computational time?

My computer has 12 cores. Maybe we can divide the array (center) into 12 parts?

The output of my function is a 2-D list.

Here is the find_neighbors() function, written without parallel computing :

# Calculate if two particles are neighbors (Periodic Boundary Condition)
#              center is the coordinates of the particles
#              d      is the distance threshold for two clusters to be neighbors
#              Maxima_Range and
#              Lattice_Range are for periodic boundary conditions
#              Pos_3d is a list.
# Suppose particle i has neighbors j,k,l.
#              Pos_3d[i] = [j,k,l]

def find_neighbors(center,d,Maxima_Range,Lattice_Range):
    a,b = np.shape(center)          # get the dimension of the array
    flag_3d = np.zeros(a)           # count the number of neighbors for particle i
    Pos_3d = [[] for i in range(a)] # Pos_3d is a two dimensional list. 
    #                               # Pos_3d[i] includes the indices of all the neighbors of particle i

    diff_sqrt = d ** 0.5
    for i in range(a):
        if i % 6 == 5:              # for some reason, when i % 6 == 5,
            #                       #     it is the center coordinates,
            #                       #     this cannot be changed.
            for j in range(i+1,a):
                if j % 6 == 5:

                    diff_x = center[i][0]-center[j][0]
                    diff_y = center[i][1]-center[j][1]
                    diff_z = center[i][2]-center[j][2]

                    if (   diff_x <  diff_sqrt
                       and diff_y <  diff_sqrt
                       and diff_z <  diff_sqrt
                       and diff_x > -diff_sqrt
                       and diff_y > -diff_sqrt
                       and diff_z > -diff_sqrt ):
                           #        # if one of the x,y,z component is already
                           #        #    bigger than d, we don't need
                           #        #    to calculate their distance in 3-D.
                           #        #    They are not neighbors.

                        diff = ( diff_x * diff_x
                               + diff_y * diff_y
                               + diff_z * diff_z
                                 )

                        if diff <= d: # without period boundary condition
                            flag_3d[i] +=1
                            flag_3d[j] +=1
                            Pos_3d[i].append(j)
                            Pos_3d[j].append(i)
                            continue
                            
                       
                        # With periodic boundary condition
                        # in x

                        do something..

    return Pos_3d

Hello, it looks like your are using `numpy`. You should use builtin functions of this module to compute vector calculations. If it is not enough, what part of the code could be executed in parallel ? — pyOliv, May 24 '20 at 04:13

score 0 · Answer 1 · answered May 24 '20 at 13:49

Q : "Is there a way to use parallel computing to shorten the computational time?"

Yes, there is.

Yet, before anyone starts trying that, repair your original code so as to avoid any and all overheads :

Best avoid lists, appending but prefer numpy vectorise-able and efficient processing data-structures :

aVectorOfDIFFs = center[i] - center[j] # aVectorOfDIFFs.shape  ( 3, ) # _x, _y, _z
if d >= np.dot( aVectorOfDIFFs,        # diff == ( ( _x * _x )
                aVectorOfDIFFs.T       #         + ( _y * _y )
                ):                     #         + ( _z * _z ) )
    # makes sense to do anything ... with ( i, j )
    ...

Besides a smart top-down re-factoring of the code into the best-performing numpy-vectorised processing ( we can turn to that later, if interested in ultimate performance ), just the plain tandem of the original for-loops is awfully inefficient, spending most of the time in, python is known about this for years, sluggish loop iterations.

It can run ( as-is ) `~ 20x` faster :

>>> from zmq import Stopwatch; aClk = Stopwatch()

>>> def aPairOfNaiveFORs( a = 1E6 ):
...     intA = int( a )                        #           a gets fused into (int)
...     for            i in range(      intA ):# original, keep looping all i-s
...         if         i % 6 == 5:             #        if i-th "center":
...             for    j in range( i+1, intA ):#           keep looping all j-s
...                 if j % 6 == 5:             #           if j-th "center":
...                     # NOP,                 #              do some usefull work
...                     # not important here   #
...                     # for the testing, the overhead comparison matters
...                     continue               #               loop next
...     return


>>> aClk.start(); _ = aPairOfNaiveFORs( 1E3 ); aClk.stop()
   14770 [us] for 1E3 ~  13+ [us] per one item of a[:1000]
   14248
   13430
   13252

>>> aClk.start(); _ = aPairOfNaiveFORs( 1E5 ); aClk.stop()
64559866 [us] for 1E5 ~ 645+ [us] per one item of a[:10000]

Now, avoid all absurd loops, that do not meet 5 == <index> modulo 6

>>> def aPairOfLessNaiveFORs( a = 1E6 ):
...     intA = int( a )
...     for     i in range (  5, intA, 6 ):    # yes, start with the 5-th item of the first matching "center"
...         for j in range( i+6, intA, 6 ):    # yes, keep matching ( modulo 6 ) steps
...               continue

The test shows `~ +13x ~ +17x` FASTER code ( just by 5/6 loops avoided ) :

>>> aClk.start(); _ = aPairOfLessNaiveFORs( 1E3 ); aClk.stop()
    1171 [us] for 1E3 ~  1 [us] per one item of a[:1000]
     976
    1157
    1181
    1185

>>> aClk.start(); _ = aPairOfLessNaiveFORs( 1E5 ); aClk.stop()
 3725167 [us] for 1E5 ~ 37 [us] per one item of a[:10000]

RESUME :

Just this precise engineering of principal overhead avoidance will shorten the looping for processing scaled somewhere about 5E5 crystaling centers ( i.e. 3E6 coordinates ).

>>> aClk.start(); _ = aPairOfLessNaiveFORs( 5E5 ); aClk.stop()
 100104334 [us] for 5E5 ~ 17.6 x FASTER
>>> aClk.start(); _ =     aPairOfNaiveFORs( 5E5 ); aClk.stop()
1759420300 [us] for 5E5 ~ ^^^^^^^^^^^^^

Even then it easily need not get any "positive"-boost from going into parallel computing model ( remember the corrected Amdahl's Law, with add-on overheads of doing so not being omitted or ignored ).

Convert for-loops to parallel computing

1 Answers1

It can run ( as-is ) ~ 20x faster :

The test shows ~ +13x ~ +17x FASTER code ( just by 5/6 loops avoided ) :

RESUME :

It can run ( as-is ) `~ 20x` faster :

The test shows `~ +13x ~ +17x` FASTER code ( just by 5/6 loops avoided ) :