use multithreading in numba

Question

I have a function that performs a point in polygon test. It takes two 2D numpy array as input (a series of points, and a polygon). The function returns a boolean as output (True if the point lies inside the polygon, False otherwise). The code is borrowed from this SO answer. An example below:

from numba import jit
from numba.pycc import CC
cc = CC('nbspatial')
import numpy as np

@cc.export('array_tracing2', 'b1[:](f8[:,:], f8[:,:])')
@jit(nopython=True, nogil=True)
def array_tracing2(xy, poly):
    D = np.empty(len(xy), dtype=numba.boolean)
    n = len(poly)
    for i in range(1, len(D) - 1):
        inside = False
        p2x = 0.0
        p2y = 0.0
        xints = 0.0
        p1x,p1y = poly[0]
        x = xy[i][0]
        y = xy[i][1]
        for i in range(n+1):
            p2x,p2y = poly[i % n]
            if y > min(p1y,p2y):
                if y <= max(p1y,p2y):
                    if x <= max(p1x,p2x):
                        if p1y != p2y:
                            xints = (y-p1y)*(p2x-p1x)/(p2y-p1y)+p1x
                        if p1x == p2x or x <= xints:
                            inside = not inside
            p1x,p1y = p2x,p2y
        D[i] = inside
    return D


if __name__ == "__main__":
    cc.compile()

The code above can be compiled by running python numba_src.py and tested with:

import numpy as np
# regular polygon for testing
lenpoly = 10000
polygon = np.array([[np.sin(x)+0.5,np.cos(x)+0.5] for x in np.linspace(0,2*np.pi,lenpoly)[:-1]])

# random points set of points to test 
N = 100000
# making a list instead of a generator to help debug
pp = np.array([np.random.random(N), np.random.random(N)]).reshape(N,2)


import nbspatial
nbspatial.array_tracing2(pp, polygon)

My attempt is to parallelize the code above so to make use of all the available CPUs.

I tryied following the example from the numba official documentation using @njit

import numba

@njit(parallel=True)
def array_tracing3(xy, poly):
    D = np.empty(len(xy), dtype=numba.boolean)
    n = len(poly)
    for i in range(1, len(D) - 1):
        inside = False
        p2x = 0.0
        p2y = 0.0
        xints = 0.0
        p1x,p1y = poly[0]
        x = xy[i][0]
        y = xy[i][1]
        for i in range(n+1):
            p2x,p2y = poly[i % n]
            if y > min(p1y,p2y):
                if y <= max(p1y,p2y):
                    if x <= max(p1x,p2x):
                        if p1y != p2y:
                            xints = (y-p1y)*(p2x-p1x)/(p2y-p1y)+p1x
                        if p1x == p2x or x <= xints:
                            inside = not inside
            p1x,p1y = p2x,p2y
        D[i] = inside
    return D

the code above completed for N=1000000 in 55'' vs 1' 33'' of the precompiled serial version. The system monitor shows only one CPU running at 100%.

How can I try to make use of the whole CPUs available, and return the result in a single array of booleansd?

Please always include an English description of what the code is supposed to do, rather than just big blocks of code. We are humans, not compilers. — John Zwinck, Sep 24 '18 at 01:43
Thanks for the comments, I added a bit of humanity in the description and a reference to a previous SO question. Hope it is better now. — epifanio, Sep 24 '18 at 02:08
Which version of Numba are you using? I'll note that 0.40 which was just released recently includes a large rewrite of the threading-related parts, and versions before that had a number of bugs (maybe 0.40 has bugs too, but I surely would not suggest earlier versions for multithreaded code). — John Zwinck, Sep 24 '18 at 02:40
I'm using git master built on Sept. 23th 2018 `'0.41.0dev0+17.g29e951436'` — epifanio, Sep 24 '18 at 02:42

score 6 · Accepted Answer · answered Sep 24 '18 at 02:46

6

Numba's parallel=True enables automatic parallelism only for certain functions, not for all loops. You should replace one of your range() expressions with numba.prange to enable multicore computation.

See: https://numba.pydata.org/numba-doc/dev/user/parallel.html

answered Sep 24 '18 at 02:46

John Zwinck

239,568
38
324
436

Thank you! that was it! -- by adding `parallel=True` and using `numba.prange` I got the code make use of all the CPUs. But it works only in the interactive non precompiled version of the code. I made a notebook to reproduce the example. Any clue why the precompiled version doesn't tun in parallel? https://gist.github.com/47340242f5be2de2c50577bf82c37143 – epifanio Sep 24 '18 at 03:16
@epifanio: `numba.prange` is just a function which calls `range`. It is detected by the JIT compiler. I guess the precompiler (`pycc`) doesn't recognize it. I guess you could file a feature request, but for now it seems like `prange` may not be expected to do anything special when precompiled (same as if you JIT compile with `parallel=False`). – John Zwinck Sep 24 '18 at 03:22
Thank you for the explanation! I updated the gist so to include a version to run the numba function on a single point (instead of an array) and then write a simple function that loops over the array using the amazing `numba.prange` – epifanio Sep 24 '18 at 03:51
2

A note: Numba's `parallel=True` seems to work fine when running the code interactively but fails if I try to pre-compile the code using `numba.pycc`. Open issue [here](https://github.com/numba/numba/issues/3336) – epifanio Jan 03 '19 at 13:42

use multithreading in numba

1 Answers1