How to accelerate my written python code: function containing nested functions for classification of points by polygons

Question

I have written the following NumPy code by Python:

def inbox_(points, polygon):
    """ Finding points in a region """

    ll = np.amin(polygon, axis=0)                        # lower limit
    ur = np.amax(polygon, axis=0)                        # upper limit

    in_idx = np.all(np.logical_and(ll <= points, points < ur), axis=1)        # points in the range [boolean]

    return in_idx


def operation_(r, gap, ends_ind):
    """ calculation formula which is applied on the points specified by inbox_ function """

    r_active = np.take(r, ends_ind)                       # taking values from "r" based on indices and shape (paired_values) of "ends_ind"
    r_sub = np.subtract.reduce(r_active, axis=1)          # subtracting each paired "r" determined by "ends_ind" [this line will be used in the final return formula]
    r_add = np.add.reduce(r_active, axis=1)               # adding each paired "r" determined by "ends_ind" [this line will be used in the final return formula]
    paired_cent_dis = np.sum((r_add, gap), axis=0)        # distance of the each two paired points

    return (np.power(gap, 2) * (np.power(paired_cent_dis, 2) + 5 * paired_cent_dis * r_add - 7 * np.power(r_sub, 2))) / (3 * paired_cent_dis)      # Formula



def elapses(r, pos, gap, ends_ind, elem_vert, contact_poss):
    if len(gap) > 0:
        elaps = np.empty([len(elem_vert), ], dtype=object)
        operate_ = operation_(r, gap, ends_ind)

        #elbav = np.empty([len(elem_vert), ], dtype=object)
        #con_num = 0

        for i, j in enumerate(elem_vert):                        # loop for each section (cell or region) of a mesh
            in_bool = inbox_(contact_poss, j)                    # getting boolean array for points within that section
            elaps[i] = np.sum(operate_[in_bool])                 # performing some calculations on that points and get the sum of them for each section
            operate_ = operate_[np.invert(in_bool)]              # slicing the arrays by deleting the points on which the calculations were performed to speed-up the code in next loops                        
            contact_poss = contact_poss[np.invert(in_bool)]      # as above

            #con_num += sum(inbox_(contact_poss, j))
            #inba_bool = inbox_(pos, j)
            #elbav[i] = 4 * np.pi * np.sum(np.power(r[inba_bool], 3)) / 3
            #pos = pos[np.invert(inba_bool)]
            #r = r[np.invert(inba_bool)]

        return elaps


r = np.load('a.npy')
pos = np.load('b.npy')
gap = np.load('c.npy')
ends_ind = np.load('d.npy')
elem_vert = np.load('e.npy')
contact_poss = np.load('f.npy')

elapses(r, pos, gap, ends_ind, elem_vert, contact_poss)

# a --------r-------> parameter corresponding to each coordinate (point); here radius                                  (23605,)     <class 'numpy.ndarray'> <class 'numpy.float64'>
# b -------pos------> coordinates of the points                                                                        (23605, 3)   <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.float64'>
# c -------gap------> if we consider points as spheres by that radii [r], it is maximum length for spheres' over-lap   (103832,)    <class 'numpy.ndarray'> <class 'numpy.float64'>
# d ----ends_ind----> indices for each over-laped spheres                                                              (103832, 2)  <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.int64'>
# e ---elem_vert----> vertices of the mesh's sections or cells                                                         (2000, 8, 3) <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.float64'>
# f --contact_poss--> a coordinate between the paired spheres                                                          (103832, 3)  <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.float64'>

This code will be called frequently from another code with big-data inputs. So, speeding up this code is essential. I have tried to use jit decorator from JAX and Numba libraries to accelerate the code, but I could not work with that properly to make the code better. I have tested the code on Colab (for 3 data sets with loops number of 20, 250, and 2000) for speed and the results were:

11  (ms), 47   (ms), 6.62 (s)       (per loop) <-- without the commented code lines in the code
137 (ms), 1.66 (s) , 4    (m)       (per loop) <-- with activating the commented code lines in the code

What this code does is finding some coordinates in a range and then performing some calculations on them.
I will be very appreciated for any answers that can speed up this code significantly (I believe it could). Also, I will be grateful for any experienced recommendations about speeding up the code by changing (substituting) the used NumPy methods and … or writing method for the math operations.

Notes:

The proposed answers must be executable by python version 2 (being applicable in both versions 2 and 3 is very excellent)
The commented code lines in the code are unnecessary for the main aim and are written just for further evaluations. Any recommendations to handle these lines with the proposed answers is appreciated (is not needed).

Data sets for test:
small data set: https://drive.google.com/file/d/1CswjyoqS8ogLmLQa_oNTOj221chDcbK8/view?usp=sharing
medium data set: https://drive.google.com/file/d/14RJ0Ackx88NzQWloops5FagzuNQYDSrh/view?usp=sharing
large data set: https://drive.google.com/file/d/1dJnXpb3HiAGcRC9PPTwui9joNcij4E_E/view?usp=sharing

As Barmar said, fix variable names and best provide also [minimal reprodicuble example](https://stackoverflow.com/help/minimal-reproducible-example) — dankal444, Dec 24 '21 at 09:38
I don't know numpy well, but the general guideline in pandas and numpy is to avoid explicit loops, try to make as much use as possible of their built-in methods that operate on a df/array as a whole, because these will be implemented in C and possibly make use of vector operations. — Barmar, Dec 24 '21 at 15:08

Jérôme Richard · Accepted Answer · 2021-12-28T14:11:47.737

First of all, the algorithm can be improved to be much more efficient. Indeed, a polygon can be directly assigned to each point. This is like a classification of points by polygons. Once the classification is done, you can perform one/many reductions by key where the key is the polygon ID.

This new algorithm consists in:

computing all the bounding boxes of the polygons;
classifying the points by polygons;
performing the reduction by key (where the key is the polygon ID).

This approach is much more efficient than iterating over all the points for each polygons and filtering the attributes arrays (eg. operate_ and contact_poss). Indeed, a filtering is an expensive operation since it requires the target array (that may not fit in the CPU caches) to be fully read and then written back. Not to mention this operation requires a temporary array to be allocated/deleted if it is not performed in-place and the operation cannot benefit from being implemented with SIMD instructions on most x86/x86-64 platforms (as it requires the new AVX-512 instruction set). It is also harder to parallelize since the filtering steps are too fast for threads to be useful but steps need to be done sequentially.

Regarding the implementation of the algorithm, Numba can be used to speed up a lot the overall computation. The main benefit of using Numba is to drastically reduce the number of expensive temporary arrays created by Numpy in your current implementation. Note that you can specify the function types to Numba so it can compile functions when it is defined. Assertions can be used to make the code more robust and help the compiler to know the size of a given dimension so to generate a significantly faster code (the JIT compiler of Numba can unroll the loops). Ternaries operators can help a bit the JIT compiler to generate a faster branch-less program.

Note the classification can be easily parallelized using multiple threads. However, one needs to be very careful about constant propagation since some critical constants (like the shape of the working arrays and assertions) tends not to be propagated to the code executed by threads while the propagation is critical to optimize the hot loops (eg. vectorization, unrolling). Note also that creating of many threads can be expensive on machines with many cores (from 10 ms to 0.1 ms). Thus, this is often better to use a parallel implementation only on big input data.

Here is the resulting implementation (working with both Python2 and Python3):

@nb.njit('float64[::1](float64[::1], float64[::1], int64[:,::1])')
def operation_(r, gap, ends_ind):
    """ calculation formula which is applied on the points specified by findMatchingPolygons_ function """

    nPoints = ends_ind.shape[0]
    assert ends_ind.shape[1] == 2
    assert gap.size == nPoints

    formula = np.empty(nPoints, dtype=np.float64)

    for i in range(nPoints):
        ind0, ind1 = ends_ind[i]
        r0, r1 = r[ind0], r[ind1]
        r_sub = r0 - r1
        r_add = r0 + r1
        cur_gap = gap[i]
        paired_cent_dis = r_add + cur_gap
        formula[i] = (cur_gap**2 * (paired_cent_dis**2 + 5 * paired_cent_dis * r_add - 7 * r_sub**2)) / (3 * paired_cent_dis)

    return formula


# Use `parallel=True` for a parallel implementation
@nb.njit('int32[::1](float64[:,::1], float64[:,:,::1])')
def findMatchingPolygons_(points, polygons):
    """ Attribute to all point a region """

    nPolygons = polygons.shape[0]
    nPolygonPoints = polygons.shape[1]
    nPoints = points.shape[0]
    assert points.shape[1] == 3
    assert polygons.shape[2] == 3

    # Compute the bounding boxes of all polygons

    ll = np.empty((nPolygons, 3), dtype=np.float64)
    ur = np.empty((nPolygons, 3), dtype=np.float64)

    for i in range(nPolygons):
        ll_x, ll_y, ll_z = polygons[i, 0]
        ur_x, ur_y, ur_z = polygons[i, 0]

        for j in range(1, nPolygonPoints):
            x, y, z = polygons[i, j]
            ll_x = x if x<ll_x else ll_x
            ll_y = y if y<ll_y else ll_y
            ll_z = z if z<ll_z else ll_z
            ur_x = x if x>ur_x else ur_x
            ur_y = y if y>ur_y else ur_y
            ur_z = z if z>ur_z else ur_z

        ll[i] = ll_x, ll_y, ll_z
        ur[i] = ur_x, ur_y, ur_z

    # Find for each point its corresponding polygon

    pointPolygonId = np.empty(nPoints, dtype=np.int32)

    # Use `nb.prange(nPoints)` for a parallel implementation
    for i in range(nPoints):
        x, y, z = points[i, 0], points[i, 1], points[i, 2]
        pointPolygonId[i] = -1

        for j in range(polygons.shape[0]):
            if ll[j, 0] <= x < ur[j, 0] and ll[j, 1] <= y < ur[j, 1] and ll[j, 2] <= z < ur[j, 2]:
                pointPolygonId[i] = j
                break

    return pointPolygonId


@nb.njit('float64[::1](float64[:,:,::1], float64[:,::1], float64[::1])')
def computeSections_(elem_vert, contact_poss, operate_):
    nPolygons = elem_vert.shape[0]
    elaps = np.zeros(nPolygons, dtype=np.float64)

    pointPolygonId = findMatchingPolygons_(contact_poss, elem_vert)

    for i, polygonId in enumerate(pointPolygonId):
        if polygonId >= 0:
            elaps[polygonId] += operate_[i]

    return elaps


def elapses(r, pos, gap, ends_ind, elem_vert, contact_poss):
    if len(gap) > 0:
        operate_ = operation_(r, gap, ends_ind)
        return computeSections_(elem_vert, contact_poss, operate_)


r = np.load('a.npy')
pos = np.load('b.npy')
gap = np.load('c.npy')
ends_ind = np.load('d.npy')
elem_vert = np.load('e.npy')
contact_poss = np.load('f.npy')

elapses(r, pos, gap, ends_ind, elem_vert, contact_poss)

Here are the results on a old 2-core machine (i7-3520M):

Small dataset:
 - Original version:                5.53 ms
 - Proposed version (sequential):   0.22 ms  (x25)
 - Proposed version (parallel):     0.20 ms  (x27)

Medium dataset:
 - Original version:               53.40 ms
 - Proposed version (sequential):   1.24 ms  (x43)
 - Proposed version (parallel):     0.62 ms  (x86)

Big dataset:
 - Original version:               5742 ms
 - Proposed version (sequential):   144 ms   (x40)
 - Proposed version (parallel):      67 ms   (x86)

Thus, the proposed implementation is up to 86 times faster than the original one.

Thanks, it is more faster than mine. I have some questions. The outputs are the same, but why I get `False` by `print(np.array_equal(original, parallel))`? Is the written code in sequential mode or parallel now? It used `nb.prange(nPoints)` in `findMatchingPolygons_` and not in `operation_`, however it seems that have not effect in `operation_`. If it is not in parallel mode, how could I modify it. I'm a little confused due to my little knowledge about numba. I have used `@nb.njit('int32[::1](float64[:,::1], float64[:,:,::1])', parallel=True)` as it is commented in the code, but make it worse — Ali_Sh, Dec 28 '21 at 05:51
There are many useful tips in the answer. Thanks for those again. Just for my understanding, where to use `nb.prange(nPoints)` as you said `Use `parallel=True` nb.prange(nPoints) for a parallel implementation`. You have written this in the function. Should it be written elsewhere? Does this command switch automatically between sequential and parallel by writing and deleting `parallel=True`? So, you recommend this method (using numba) will be faster and more efficient than any parallelization and vectorization of my written code for the purpose? — Ali_Sh, Dec 28 '21 at 06:10
`np.array_equal(original, parallel)` is `False` because there is a slight change in the floating point result. You can check the relative difference is good with `np.max(np.abs((original-parallel)/original))`. The result should be very close the the precision limit of the 64-bit floats. This slight changes are due to the computation of the sum which does not use the exact same algorithm (Numpy use a pair-wise sum while this algorithm use a classical sum). Keep in mind that float-based operation are not associative. If you want an extreme accuracy, you can use a Kahan-sum (expensive). — Jérôme Richard, Dec 28 '21 at 14:06
For the code, it was supposed to be sequential but I forgot to remove the `nb.prange` in `findMatchingPolygons_`. I fixed the answer (sequential code). If you test with a small input, having a slower execution is expected. However, for a big input, you should get a speed-up if you work with at least 4 cores. I got some issue with the propagation of constants on my 2-core machine as stated in the answer but succeed to overcome this with small fixes. You may experiences similar issues. This is sad but this is a missed optimization of current mainstream compilers (e. GCC & Clang/LLVM-lite). — Jérôme Richard, Dec 28 '21 at 14:16
I have checked the results for `np.max(np.abs((original-parallel)/original))` and it was around `2e-15 to e-16` which is acceptable. Unfortunately, I am using **windows** and *numba parallelization* is not applicable with *numba version 0.47.0 and Python 2.7* and will got [error](https://github.com/numba/numba/issues/5009): `UnsupportedParforsError: Failed in nopython mode pipeline (step: nopython mode backend) The 'parallel' target is not currently supported on Windows operating systems when using Python 2.7, or on 32 bit hardware.` — Ali_Sh, Dec 31 '21 at 11:25
Indeed, I got the same issue only with Python 2. I think you can only use the sequential implementation. Python 2.7 is now very old and deprecated: it as been release 12 years ago and is obsolete since few years. Its supports has end since 2 years (even bugs and vulnerability are not maintained anymore). Numba does not support Python 2 anymore too... Many library have now a minimal support but they are recently massively shifting to Python 3. — Jérôme Richard, Dec 31 '21 at 13:38
Did you test it on windows or 32 bit systems (Did you get this error on other operating systems or 64 bit hardware) using Python 2.7? The error that I got seems to happened because of using one of them. — Ali_Sh, Dec 31 '21 at 19:22
I have a 64-bit Windows and I installed a 64-bit Python (2.7.18 AMD64) but still had the same problem. IDK if Numba (retrieved by pip) is compiled for a 32-bit platform. — Jérôme Richard, Jan 01 '22 at 10:41
Non-topic related question: If for, just, `operation_` function, we want to return more than one parameter e.g. if `coords = np.array(x,y,z)`, `formula[i] = coords[0], coords[1], coords[2], r, f1, f2`; it could be achieved by changing `float64[::1]` to `float64[:,::1]` and using `np.empty((nPoints, 6)` instead. Is it possible to create `formula[i] = coords, r, np.array([f1, f2])`? If so, what changes should I do? I would appreciate your help. — Ali_Sh, Feb 07 '22 at 05:34
If I correctly understand your point, using a `float64[:,::1]` type should indeed work. For the `formula[i] = coords, r, np.array([f1, f2])` line, I do not think it will work because AFAIK the Python pattern matching does not support that yet. Assuming `coords[0], coords[1], coords[2], r, f1, f2` are all scalars, you can just do `formula[i, :] = (coords[0], coords[1], coords[2], r, f1, f2)`. If they are not scalars, then I think I did not catch your question. — Jérôme Richard, Feb 08 '22 at 17:57

How to accelerate my written python code: function containing nested functions for classification of points by polygons

1 Answers1

Linked