4

I have a function that operates on a 2D matrix on float64(x,y). Basic concept: for each combination of rows (no. rows choose 2) count the number of positiv values after subtraction (row1 - row2). In a 2Dmatrix of int64(y,y) store this value in index [row1,row2] if value is above a certain threshold and [row2,row1] if below.

I've implemented that and decorated it with @njit(parallel=False), that works fine @njit(parallel=True) seems to give no speedup. Trying to speed up the whole thing I had a look at @guvectorize, that works as well. However I'm not able to figure out how to use @guvectorize with parallel true in this case either.

I had a look at numba guvectorize target='parallel' slower than target='cpu' , where the solution was to use @vecorize instead, but I can not transfer the solution to my problem, therefore I am now seeking help :)

Basic jitted and guvectorized implementation

import numpy as np
from numba import jit, guvectorize, prange
import timeit

@jit(parallel=False)
def check_pairs_sg(raw_data):
    # 2D array to be filled
    result = np.full((len(raw_data), len(raw_data)), -1)

    # Iterate over all possible gene combinations
    for r1 in range(0, len(raw_data)):
        for r2 in range(r1+1, len(raw_data)):
            diff = np.subtract(raw_data[:, r1], raw_data[:, r2])

            num_pos = len(np.where(diff > 0)[0])

            # Arbitrary check to illustrate
            if num_pos >= 5: 
               result[r1,r2] = num_pos
            else:
               result[r2,r1] = num_pos

    return result

@jit(parallel=True)
def check_pairs_multi(raw_data):
    # 2D array to be filled
    result = np.full((len(raw_data), len(raw_data)), -1)

    # Iterate over all possible gene combinations
    for r1 in range(0, len(raw_data)):
        for r2 in prange(r1+1, len(raw_data)):
            diff = np.subtract(raw_data[:, r1], raw_data[:, r2])

            num_pos = len(np.where(diff > 0)[0])

            # Arbitrary check to illustrate
            if num_pos >= 5: 
               result[r1,r2] = num_pos
            else:
               result[r2,r1] = num_pos

    return result

@guvectorize(["void(float64[:,:], int64[:,:])"],
             "(n,m)->(m,m)", target='cpu')
def check_pairs_guvec_sg(raw_data, result):
    for r1 in range(0, len(result)):
        for r2 in range(r1+1, len(result)):
            diff = np.subtract(raw_data[:, r1], raw_data[:, r2])

            num_pos = len(np.where(diff > 0)[0])

            # Arbitrary check to illustrate
            if num_pos >= 5: 
               result[r1,r2] = num_pos
            else:
               result[r2,r1] = num_pos

@guvectorize(["void(float64[:,:], int64[:,:])"],
             "(n,m)->(m,m)", target='parallel')
def check_pairs_guvec_multi(raw_data, result):
    for r1 in range(0, len(result)):
        for r2 in range(r1+1, len(result)):
            diff = np.subtract(raw_data[:, r1], raw_data[:, r2])

            num_pos = len(np.where(diff > 0)[0])

            # Arbitrary check to illustrate
            if num_pos >= 5: 
               result[r1,r2] = num_pos
            else:
               result[r2,r1] = num_pos

if __name__=="__main__":
     np.random.seed(404)
     a = np.random.random((512,512)).astype(np.float64)
     res = np.full((len(a), len(a)), -1)

and measured with

%timeit check_pairs_sg(a)
%timeit check_pairs_multi(a)
%timeit check_pairs_guvec_sg(a, res)
%timeit check_pairs_guvec_multi(a, res)

resulting in:

614 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
507 ms ± 6.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
622 ms ± 3.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
671 ms ± 4.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I cat wrap my head around on how to implement this as @vectorized or a proper parallel @guvectorize to fill the resulting 2D array truely in parallel.

I guess this is my first step before trying to taking this further to gpu.

Any help is highly appreciated.

iR0Nic
  • 320
  • 6
  • 11
  • "I guess this is my first step before trying to taking this further to gpu.". Are you sure that you can't get rid of the nested `for` loops first before even turning to `numba`? – roganjosh Mar 28 '19 at 13:11
  • It's not clear to me how I should run this to have a go at doing that because you have two `if __name__ == '__main__'` guards – roganjosh Mar 28 '19 at 13:13
  • Well I could use ```itertools``` to yield a combination tuple after the other, to only have one loop, but how would that help? ps: i removed the second main – iR0Nic Mar 28 '19 at 13:14
  • Well, it looks like you could also use `np.roll()` to offset by 1 instead of the inner loop but I can't tell because I don't know what the function is doing. – roganjosh Mar 28 '19 at 13:15
  • 1
    In fact, I'm pretty sure most of this could be vectorized, dropping some `for` loops and the `if` checks too. Jumping to numba and then to GPU before ensuring the numpy approach is appropriate is neglecting the problem. – roganjosh Mar 28 '19 at 13:16
  • Sorry if the example is limited and too vague. If you like you can have a look at the full code and its purpose at https://github.com/rfechtner/pypairs/blob/master/pypairs/tools/sandbag.py – iR0Nic Mar 28 '19 at 13:17
  • Please give an example of `raw_data` and I can have a go at vectorizing it (realistically I'm aware that this will just be randomly generated data but I'm curious what it should look like) – roganjosh Mar 28 '19 at 13:25
  • first of: thank you very much for your time. I really appreciate it! you could use ```np.random.uniform(low=0.5, high=13.3, size=(10,10))```. Or you can have a look at [test_sandbag](https://github.com/rfechtner/pypairs/blob/master/pypairs/tests/test_sandbag.py) – iR0Nic Mar 28 '19 at 13:32
  • Are you sure this code works? I get the same result every single time, regardless of the random input. I'm looking specifically at `check_pairs_sg` – roganjosh Mar 28 '19 at 13:46
  • ```np.random.seed(404); test = np.random.random((10,10)).astype(np.float64); res_a = check_pairs_sg(test); np.random.seed(343); test = np.random.random((10,10)).astype(np.float64); res_b = check_pairs_sg(test); np.allclose(res_a,res_b) ``` => False – iR0Nic Mar 28 '19 at 13:50
  • I haven't set a seed. _Without_ a seed it still gives the same answer, so the result is deterministic regardless of any random input. – roganjosh Mar 28 '19 at 13:52
  • That goes completely against the premise in your first paragraph. Why should my output _always_ be the same if the input numbers are random and changing? In that case, you might as well just return a fixed array and completely ignore the input to the function or anything in the function body – roganjosh Mar 28 '19 at 13:56
  • ```np.random.seed(404); test_a = np.random.random((10,10)).astype(np.float64); res_a = check_pairs_sg(test_a); test_b = np.random.random((10,10)).astype(np.float64); res_b = check_pairs_sg(test_b); print(np.allclose(test_a,test_b)) // False; print(np.allclose(res_a,res_b)) // False; ``` Sorry I cant quite follow you, the algorithm is deterministic: same input -> same output; other random input -> other random output – iR0Nic Mar 28 '19 at 13:56
  • Ok, sorry, I'm missing something here because I can see what you're trying to illustrate but the output on my screen simply does not match it. I'm visually inspecting the outputs for a smaller sample and it's just always the same. Perhaps it's not possible for me to scale the approach back to something I can visually test against – roganjosh Mar 28 '19 at 13:58
  • https://repl.it/@RonFechtner/ElderlyInvolvedDesigns – iR0Nic Mar 28 '19 at 14:01
  • Your access pattern to the array is suboptimal. This has a high impact on performance (numpy-arrays are by default C-Ordered). One other problem is to create temporary arrays.. – max9111 Mar 28 '19 at 16:36
  • @max9111 could you elaborate more? I understand that in c ordered arrays the last index is the frequent one. How would a fortan ordered help? – iR0Nic Mar 28 '19 at 19:24

1 Answers1

5

Think of other compiled languages when writing Numba code

For example think of a more or less exact equivalent implementation of the lines

diff = np.subtract(raw_data[:, r1], raw_data[:, r2])
num_pos = len(np.where(diff > 0)[0])

in C++.

Pseudo Code

  • Allocate an Array diff, loop over raw_data[i*size_dim_1+r1] (loop index is i)
  • Allocate a Boolean Array, loop over the whole array diff and check if diff[i]>0
  • Loop over the Boolean Array, get the indices where b_arr==True and save them via vector::push_back() to a vector.
  • Check the size of the vector

The main problems in your code are:

  • Creating temporary arrays for simple operations
  • Non-contigous memory access

Optimizing the code

Removing temporary arrays and simplification

@nb.njit(parallel=False)
def check_pairs_simp(raw_data):
    # 2D array to be filled
    result = np.full((raw_data.shape[0],raw_data.shape[1]), -1)
    
    # Iterate over all possible gene combinations
    for r1 in range(0, raw_data.shape[1]):
        for r2 in range(r1+1, raw_data.shape[1]):
            num_pos=0
            for i in range(raw_data.shape[0]):
                if (raw_data[i,r1]>raw_data[i,r2]):
                    num_pos+=1
            
            # Arbitrary check to illustrate
            if num_pos >= 5: 
               result[r1,r2] = num_pos
            else:
               result[r2,r1] = num_pos
    
    return result

Removing temporary arrays and simplification + contigous memory access

@nb.njit(parallel=False)
def check_pairs_simp_rev(raw_data_in):
    #Create a transposed array not just a view 
    raw_data=np.ascontiguousarray(raw_data_in.T)
    
    # 2D array to be filled
    result = np.full((raw_data.shape[0],raw_data.shape[1]), -1)
    
    # Iterate over all possible gene combinations
    for r1 in range(0, raw_data.shape[0]):
        for r2 in range(r1+1, raw_data.shape[0]):
            num_pos=0
            for i in range(raw_data.shape[1]):
                if (raw_data[r1,i]>raw_data[r2,i]):
                    num_pos+=1
            
            # Arbitrary check to illustrate
            if num_pos >= 5: 
               result[r1,r2] = num_pos
            else:
               result[r2,r1] = num_pos
    
    return result

Removing temporary arrays and simplification + contigous memory access + Parallelization

@nb.njit(parallel=True,fastmath=True)
def check_pairs_simp_rev_p(raw_data_in):
    #Create a transposed array not just a view 
    raw_data=np.ascontiguousarray(raw_data_in.T)
    
    # 2D array to be filled
    result = np.full((raw_data.shape[0],raw_data.shape[1]), -1)
    
    # Iterate over all possible gene combinations
    for r1 in nb.prange(0, raw_data.shape[0]):
        for r2 in range(r1+1, raw_data.shape[0]):
            num_pos=0
            for i in range(raw_data.shape[1]):
                if (raw_data[r1,i]>raw_data[r2,i]):
                    num_pos+=1
            
            # Arbitrary check to illustrate
            if num_pos >= 5: 
               result[r1,r2] = num_pos
            else:
               result[r2,r1] = num_pos
    
    return result

Timings

%timeit check_pairs_sg(a)
488 ms ± 8.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit check_pairs_simp(a)
186 ms ± 3.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit check_pairs_simp_rev(a)
12.1 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit check_pairs_simp_rev_p(a)
5.43 ms ± 49.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Community
  • 1
  • 1
max9111
  • 6,272
  • 1
  • 16
  • 33
  • 1
    Instead of using `if (raw_data[r1,i]-raw_data[r2,i]) > 0` you could also use `if raw_data[r1,i] > raw_data[r2,i]` – MSeifert Mar 28 '19 at 19:42
  • @ MSeifert Thanks for the comment. This gives another 10% speedup – max9111 Mar 28 '19 at 20:03
  • Hey max9111 thank you very much for the insights. Your answer not only helped to improve my code significantly but also made me understand a lot about numpy. Cheers – iR0Nic Mar 30 '19 at 08:58