4

Using Python (3.7.7) and numpy (1.17.4), I am working with medium sized 2d numpy arrays (from 5000x80 up to 200,000x120). For a given array, I want to calculate the Hadamard product between all possbible uniqe pairs of column-vectors of that array.

I have:

    A           A
[a,b,c,d]   [a,b,c,d]
[1,2,3,4]   [1,2,3,4]
[4,5,6,7] * [4,5,6,7]
[7,8,9,1]   [7,8,9,1]

and I want to get:

[a*b, ac,  ad,  bc,  bd,  cd]
[ 2.,  3.,  4.,  6.,  8., 12.]
[20., 24., 28., 30., 35., 42.]
[56., 63.,  7., 72.,  8.,  9.]

I already have a solution from a colleague using np.kron which I adapated a bit:

def hadamard_kron(A: np.ndarray) -> :
    """Returns the hadamard products of all unique pairs of all columns, 

    and return indices signifying which columns constitute a given pair.
    """

    n = raw_inputs.shape[0]
    ind1 = (np.kron(np.arange(0, n).reshape((n, 1)), np.ones((n, 1)))).squeeze().astype(int)
    ind2 = (np.kron(np.ones((n, 1)), np.arange(0, n).reshape((n, 1)))).squeeze().astype(int)
    xmat2 = np.kron(raw_inputs, np.ones((n, 1))) * np.kron(np.ones((n, 1)), raw_inputs)

    hadamard_inputs =  xmat2[ind2 > ind1, :]
    ind1_ = ind1[ind1 < ind2]
    ind2_ = ind2[ind1 < ind2]
    return hadamard_A, ind1_, ind2_

hadamard_A, first_pair_members, second_pair_members = hadamard_kron(a.transpose())

Note that hadamard_A is what I want, but transposed (which is also what I want for further processing). Also, ind1_ (ind2_) gives the indices for the objects which feature as the first (second) element in the pair for which the hadamard product is calculated. I need those as well.

However, I feel this code is too inefficient: it takes to long and since I call this function several times during my algorithm, I was wondering whether there is a cleverer solution? Am I overlooking some numpy/scipy tools I could cleverly combine for this task?

Thanks all! :)

Divakar
  • 218,885
  • 19
  • 262
  • 358
cheshire
  • 53
  • 4
  • Did either of the posted solutions work for you? – Divakar Sep 03 '20 at 13:12
  • Apologies for my late reply, still getting into Stackoverflow's flow :) App2 and numba_parallel are really good, posted another comment below though. – cheshire Sep 07 '20 at 12:21

2 Answers2

3

Approach #1

Simplest one with np.triu_indices -

In [45]: a
Out[45]: 
array([[1, 2, 3, 4],
       [4, 5, 6, 7],
       [7, 8, 9, 1]])

In [46]: r,c = np.triu_indices(a.shape[1],1)

In [47]: a[:,c]*a[:,r]
Out[47]: 
array([[ 2,  3,  4,  6,  8, 12],
       [20, 24, 28, 30, 35, 42],
       [56, 63,  7, 72,  8,  9]])

Approach #2

Memory-efficient one for large arrays -

m,n = a.shape
s = np.r_[0,np.arange(n-1,-1,-1).cumsum()]
out = np.empty((m, n*(n-1)//2), dtype=a.dtype)
for i,(s0,s1) in enumerate(zip(s[:-1], s[1:])):
    out[:,s0:s1] = a[:,i,None] * a[:,i+1:]

Approach #3

Masking based one -

m,n = a.shape
mask = ~np.tri(n,dtype=bool)
m3D = np.broadcast_to(mask, (m,n,n))

b1 = np.broadcast_to(a[...,None], (m,n,n))
b2 = np.broadcast_to(a[:,None,:], (m,n,n))
out = (b1[m3D]* b2[m3D]).reshape(m,-1)

Approach #4

Extend approach #2 for a numba one -

from numba import njit

def numba_app(a):
    m,n = a.shape
    out = np.empty((m, n*(n-1)//2), dtype=a.dtype)
    return numba_func(a,out,m,n)

@njit
def numba_func(a,out,m,n):
    for p in range(m):
        I = 0
        for i in range(n):
            for j in range(i+1,n):
                out[p,I] = a[p,i] * a[p,j]
                I += 1
    return out

Then, leverage parallel processing (as pointed out in comments by @max9111), like so -

from numba import prange

def numba_app_parallel(a):
    m,n = a.shape
    out = np.empty((m, n*(n-1)//2), dtype=a.dtype)
    return numba_func_parallel(a,out,m,n)

@njit(parallel=True)
def numba_func_parallel(a,out,m,n):
    for p in prange(m):
        I = 0
        for i in range(n):
            for j in range(i+1,n):
                out[p,I] = a[p,i] * a[p,j]
                I += 1
    return out

Benchmarking

Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.

import benchit
in_ = [np.random.rand(5000, 80), np.random.rand(10000, 100), np.random.rand(20000, 120)]
funcs = [ehsan, app1, app2, app3, numba_app, numba_app_parallel]
t = benchit.timings(funcs, in_, indexby='shape')
t.rank()
t.plot(logx=False, save='timings.png')

enter image description here

Conclusion : Numba ones seem to be doing pretty well and app2 among NumPy ones.

Divakar
  • 218,885
  • 19
  • 262
  • 358
  • 1
    In your Numba code you use a non-aligned memory access pattern. If you put the inner loop `for p in range(m):` at the first place, the Numba solution should be approx. 3 times faster than app_2, if you parallelize the Numba function, (or maybe by using some intrinsic) there is another factor of 2 possible, in this write to memory limited example) – max9111 Aug 26 '20 at 13:39
  • @max9111 Yup, that's what I was womdering on why numba was doing so bad. Thanks! Edited for the memory aligned accesses. On your second part, did you mean using `@njit(parallel=True)`? That didn't boost any perf.. Or maybe you meant with the datatype specified syntax? That could be tried I guess. – Divakar Aug 26 '20 at 13:59
  • Yes and `for p in numba.prange(m):` on the outer loop. But Number of cores does not matter (it is always a factor of two with at least two cores). In C you would use a non temporal store to get full write bandwidth with only one core https://stackoverflow.com/a/25835186/4045774 – max9111 Aug 26 '20 at 14:07
  • @max9111 `prange` didn't help, not in a noticeable way anyway. Maybe needs a better system config? – Divakar Aug 26 '20 at 14:08
  • @max9111 Yeah, I was using parallel without prange. That was the mistake. Thanks! Edited post with those numbers. – Divakar Aug 27 '20 at 06:03
  • Thanks all a lot for your help, hope I will write that efficient code at some point in my life! Using benchit (I like it), I can replicate that _numba app parallel_ is the fastest. However, if I return not just `out` but `out.transpose()`, _numba app parallel_ is slower than _app2_ (both being slower than not transposing and using numba parallel). Is there a numba specific reason why transposing within the function definition makes the whole function so much slower? Edit: I need the result to be transposed. Then again, I could transpose the input and get the Hadamard products of all row pairs. – cheshire Sep 07 '20 at 12:18
  • @cheshire Yeah I think numba forces copy with that transpose. If your endgoal is the transposed version, simply get the single output off numba and transpose at Python/NumPy level. – Divakar Sep 09 '20 at 19:38
0

Another equivalent approach to Divakar's first approach:

r,c = np.triu_indices(A.shape[1],1)
np.einsum('ij,ik->ijk',A,A)[:,r,c]

output:

[[ 2  3  4  6  8 12]
 [20 24 28 30 35 42]
 [56 63  7 72  8  9]]
Ehsan
  • 12,072
  • 2
  • 20
  • 33