Numba - How to fill 2D array in parallel

Question

I have a function that operates on a 2D matrix on float64(x,y). Basic concept: for each combination of rows (no. rows choose 2) count the number of positiv values after subtraction (row1 - row2). In a 2Dmatrix of int64(y,y) store this value in index [row1,row2] if value is above a certain threshold and [row2,row1] if below.

I've implemented that and decorated it with @njit(parallel=False), that works fine @njit(parallel=True) seems to give no speedup. Trying to speed up the whole thing I had a look at @guvectorize, that works as well. However I'm not able to figure out how to use @guvectorize with parallel true in this case either.

I had a look at numba guvectorize target='parallel' slower than target='cpu' , where the solution was to use @vecorize instead, but I can not transfer the solution to my problem, therefore I am now seeking help :)

Basic jitted and guvectorized implementation

import numpy as np
from numba import jit, guvectorize, prange
import timeit

@jit(parallel=False)
def check_pairs_sg(raw_data):
    # 2D array to be filled
    result = np.full((len(raw_data), len(raw_data)), -1)

    # Iterate over all possible gene combinations
    for r1 in range(0, len(raw_data)):
        for r2 in range(r1+1, len(raw_data)):
            diff = np.subtract(raw_data[:, r1], raw_data[:, r2])

            num_pos = len(np.where(diff > 0)[0])

            # Arbitrary check to illustrate
            if num_pos >= 5: 
               result[r1,r2] = num_pos
            else:
               result[r2,r1] = num_pos

    return result

@jit(parallel=True)
def check_pairs_multi(raw_data):
    # 2D array to be filled
    result = np.full((len(raw_data), len(raw_data)), -1)

    # Iterate over all possible gene combinations
    for r1 in range(0, len(raw_data)):
        for r2 in prange(r1+1, len(raw_data)):
            diff = np.subtract(raw_data[:, r1], raw_data[:, r2])

            num_pos = len(np.where(diff > 0)[0])

            # Arbitrary check to illustrate
            if num_pos >= 5: 
               result[r1,r2] = num_pos
            else:
               result[r2,r1] = num_pos

    return result

@guvectorize(["void(float64[:,:], int64[:,:])"],
             "(n,m)->(m,m)", target='cpu')
def check_pairs_guvec_sg(raw_data, result):
    for r1 in range(0, len(result)):
        for r2 in range(r1+1, len(result)):
            diff = np.subtract(raw_data[:, r1], raw_data[:, r2])

            num_pos = len(np.where(diff > 0)[0])

            # Arbitrary check to illustrate
            if num_pos >= 5: 
               result[r1,r2] = num_pos
            else:
               result[r2,r1] = num_pos

@guvectorize(["void(float64[:,:], int64[:,:])"],
             "(n,m)->(m,m)", target='parallel')
def check_pairs_guvec_multi(raw_data, result):
    for r1 in range(0, len(result)):
        for r2 in range(r1+1, len(result)):
            diff = np.subtract(raw_data[:, r1], raw_data[:, r2])

            num_pos = len(np.where(diff > 0)[0])

            # Arbitrary check to illustrate
            if num_pos >= 5: 
               result[r1,r2] = num_pos
            else:
               result[r2,r1] = num_pos

if __name__=="__main__":
     np.random.seed(404)
     a = np.random.random((512,512)).astype(np.float64)
     res = np.full((len(a), len(a)), -1)

and measured with

%timeit check_pairs_sg(a)
%timeit check_pairs_multi(a)
%timeit check_pairs_guvec_sg(a, res)
%timeit check_pairs_guvec_multi(a, res)

resulting in:

614 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
507 ms ± 6.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
622 ms ± 3.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
671 ms ± 4.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I cat wrap my head around on how to implement this as @vectorized or a proper parallel @guvectorize to fill the resulting 2D array truely in parallel.

I guess this is my first step before trying to taking this further to gpu.

Any help is highly appreciated.

"I guess this is my first step before trying to taking this further to gpu.". Are you sure that you can't get rid of the nested `for` loops first before even turning to `numba`? — roganjosh, Mar 28 '19 at 13:11
It's not clear to me how I should run this to have a go at doing that because you have two `if __name__ == '__main__'` guards — roganjosh, Mar 28 '19 at 13:13
Well I could use ```itertools``` to yield a combination tuple after the other, to only have one loop, but how would that help? ps: i removed the second main — iR0Nic, Mar 28 '19 at 13:14
Well, it looks like you could also use `np.roll()` to offset by 1 instead of the inner loop but I can't tell because I don't know what the function is doing. — roganjosh, Mar 28 '19 at 13:15
In fact, I'm pretty sure most of this could be vectorized, dropping some `for` loops and the `if` checks too. Jumping to numba and then to GPU before ensuring the numpy approach is appropriate is neglecting the problem. — roganjosh, Mar 28 '19 at 13:16
Sorry if the example is limited and too vague. If you like you can have a look at the full code and its purpose at https://github.com/rfechtner/pypairs/blob/master/pypairs/tools/sandbag.py — iR0Nic, Mar 28 '19 at 13:17
Please give an example of `raw_data` and I can have a go at vectorizing it (realistically I'm aware that this will just be randomly generated data but I'm curious what it should look like) — roganjosh, Mar 28 '19 at 13:25
first of: thank you very much for your time. I really appreciate it! you could use ```np.random.uniform(low=0.5, high=13.3, size=(10,10))```. Or you can have a look at [test_sandbag](https://github.com/rfechtner/pypairs/blob/master/pypairs/tests/test_sandbag.py) — iR0Nic, Mar 28 '19 at 13:32
Are you sure this code works? I get the same result every single time, regardless of the random input. I'm looking specifically at `check_pairs_sg` — roganjosh, Mar 28 '19 at 13:46
```np.random.seed(404); test = np.random.random((10,10)).astype(np.float64); res_a = check_pairs_sg(test); np.random.seed(343); test = np.random.random((10,10)).astype(np.float64); res_b = check_pairs_sg(test); np.allclose(res_a,res_b) ``` => False — iR0Nic, Mar 28 '19 at 13:50
I haven't set a seed. _Without_ a seed it still gives the same answer, so the result is deterministic regardless of any random input. — roganjosh, Mar 28 '19 at 13:52
That goes completely against the premise in your first paragraph. Why should my output _always_ be the same if the input numbers are random and changing? In that case, you might as well just return a fixed array and completely ignore the input to the function or anything in the function body — roganjosh, Mar 28 '19 at 13:56
```np.random.seed(404); test_a = np.random.random((10,10)).astype(np.float64); res_a = check_pairs_sg(test_a); test_b = np.random.random((10,10)).astype(np.float64); res_b = check_pairs_sg(test_b); print(np.allclose(test_a,test_b)) // False; print(np.allclose(res_a,res_b)) // False; ``` Sorry I cant quite follow you, the algorithm is deterministic: same input -> same output; other random input -> other random output — iR0Nic, Mar 28 '19 at 13:56
Ok, sorry, I'm missing something here because I can see what you're trying to illustrate but the output on my screen simply does not match it. I'm visually inspecting the outputs for a smaller sample and it's just always the same. Perhaps it's not possible for me to scale the approach back to something I can visually test against — roganjosh, Mar 28 '19 at 13:58
Your access pattern to the array is suboptimal. This has a high impact on performance (numpy-arrays are by default C-Ordered). One other problem is to create temporary arrays.. — max9111, Mar 28 '19 at 16:36
@max9111 could you elaborate more? I understand that in c ordered arrays the last index is the frequent one. How would a fortan ordered help? — iR0Nic, Mar 28 '19 at 19:24

score 5 · Accepted Answer · edited Jun 20 '20 at 09:12

Think of other compiled languages when writing Numba code

For example think of a more or less exact equivalent implementation of the lines

diff = np.subtract(raw_data[:, r1], raw_data[:, r2])
num_pos = len(np.where(diff > 0)[0])

in C++.

Pseudo Code

Allocate an Array diff, loop over raw_data[i*size_dim_1+r1] (loop index is i)
Allocate a Boolean Array, loop over the whole array diff and check if diff[i]>0
Loop over the Boolean Array, get the indices where b_arr==True and save them via vector::push_back() to a vector.
Check the size of the vector

The main problems in your code are:

Creating temporary arrays for simple operations
Non-contigous memory access

Optimizing the code

Removing temporary arrays and simplification

@nb.njit(parallel=False)
def check_pairs_simp(raw_data):
    # 2D array to be filled
    result = np.full((raw_data.shape[0],raw_data.shape[1]), -1)
    
    # Iterate over all possible gene combinations
    for r1 in range(0, raw_data.shape[1]):
        for r2 in range(r1+1, raw_data.shape[1]):
            num_pos=0
            for i in range(raw_data.shape[0]):
                if (raw_data[i,r1]>raw_data[i,r2]):
                    num_pos+=1
            
            # Arbitrary check to illustrate
            if num_pos >= 5: 
               result[r1,r2] = num_pos
            else:
               result[r2,r1] = num_pos
    
    return result

Removing temporary arrays and simplification + contigous memory access

@nb.njit(parallel=False)
def check_pairs_simp_rev(raw_data_in):
    #Create a transposed array not just a view 
    raw_data=np.ascontiguousarray(raw_data_in.T)
    
    # 2D array to be filled
    result = np.full((raw_data.shape[0],raw_data.shape[1]), -1)
    
    # Iterate over all possible gene combinations
    for r1 in range(0, raw_data.shape[0]):
        for r2 in range(r1+1, raw_data.shape[0]):
            num_pos=0
            for i in range(raw_data.shape[1]):
                if (raw_data[r1,i]>raw_data[r2,i]):
                    num_pos+=1
            
            # Arbitrary check to illustrate
            if num_pos >= 5: 
               result[r1,r2] = num_pos
            else:
               result[r2,r1] = num_pos
    
    return result

Removing temporary arrays and simplification + contigous memory access + Parallelization

@nb.njit(parallel=True,fastmath=True)
def check_pairs_simp_rev_p(raw_data_in):
    #Create a transposed array not just a view 
    raw_data=np.ascontiguousarray(raw_data_in.T)
    
    # 2D array to be filled
    result = np.full((raw_data.shape[0],raw_data.shape[1]), -1)
    
    # Iterate over all possible gene combinations
    for r1 in nb.prange(0, raw_data.shape[0]):
        for r2 in range(r1+1, raw_data.shape[0]):
            num_pos=0
            for i in range(raw_data.shape[1]):
                if (raw_data[r1,i]>raw_data[r2,i]):
                    num_pos+=1
            
            # Arbitrary check to illustrate
            if num_pos >= 5: 
               result[r1,r2] = num_pos
            else:
               result[r2,r1] = num_pos
    
    return result

Timings

%timeit check_pairs_sg(a)
488 ms ± 8.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit check_pairs_simp(a)
186 ms ± 3.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit check_pairs_simp_rev(a)
12.1 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit check_pairs_simp_rev_p(a)
5.43 ms ± 49.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Instead of using `if (raw_data[r1,i]-raw_data[r2,i]) > 0` you could also use `if raw_data[r1,i] > raw_data[r2,i]` — MSeifert, Mar 28 '19 at 19:42
@ MSeifert Thanks for the comment. This gives another 10% speedup — max9111, Mar 28 '19 at 20:03
Hey max9111 thank you very much for the insights. Your answer not only helped to improve my code significantly but also made me understand a lot about numpy. Cheers — iR0Nic, Mar 30 '19 at 08:58

Numba - How to fill 2D array in parallel

1 Answers1

Think of other compiled languages when writing Numba code

Optimizing the code

Linked