4

Toy example

I have two arrays, which have different shape, for example:

import numpy as np
matrix = np.arange(5*6*7*8).reshape(5, 6, 7, 8)
vector = np.arange(1, 20, 2)

What I want to do is to multiply each element of the matrix by one of the element of 'vector' and then do the sum over the last two axis. For that, I have an array with the same shape as 'matrix' that tells me which index to use, for example:

Idx = (matrix+np.random.randint(0, vector.size, size=matrix.shape))%vector.size

I know that one of the solution would be to do:

matVec = vector[Idx]
res = np.sum(matrix*matVec, axis=(2, 3))

or even:

res = np.einsum('ijkl, ijkl -> ij', matrix, matVec)

Problem

However, my problems is that my arrays are big and the construction of matVec takes both times and memory. So is there a way to bypass that and still achieve the same result ?

More realistic example

Here is a more realistic example of what I'm actually doing:

import numpy as np

order = 20
dim = 23

listOrder = np.arange(-order, order+1, 1)
N, P = np.meshgrid(listOrder, listOrder)
K = np.arange(-2*dim+1, 2*dim+1, 1)
X = np.arange(-2*dim, 2*dim, 1)

tN = np.einsum('..., p, x -> ...px', N, np.ones(K.shape, dtype=int), np.ones(X.shape, dtype=int))#, optimize=pathInt)
tP = np.einsum('..., p, x -> ...px', P, np.ones(K.shape, dtype=int), np.ones(X.shape, dtype=int))#, optimize=pathInt)
tK = np.einsum('..., p, x -> ...px', np.ones(P.shape, dtype=int), K, np.ones(X.shape, dtype=int))#, optimize=pathInt)
tX = np.einsum('..., p, x -> ...px', np.ones(P.shape, dtype=int), np.ones(K.shape, dtype=int), X)#, optimize=pathInt)
tL = tK + tX

mini, maxi = -4*dim+1, 4*dim-1
NmPp2L = np.arange(2*mini-2*order, 2*maxi+2*order+1)
Idx = (2*tL+tN-tP) - NmPp2L[0]

np.random.seed(0)
matrix = (np.random.rand(Idx.size) + 1j*np.random.rand(Idx.size)).reshape(Idx.shape)
vector = np.random.rand(np.max(Idx)+1) + 1j*np.random.rand(np.max(Idx)+1)

res = np.sum(matrix*vector[Idx], axis=(2, 3))
Görg
  • 141
  • 4
  • 3
    I think operation vector[Idx] should not take considerable time unless your memory is full. – Okapi575 Sep 13 '22 at 12:39
  • The thing is I do this operation a number of times which as a all takes times. I made a test which ran for 100s, from which 10s where used only for this operation. Also, for the moment memory is not a problem, but it will become at some points and that is also why I would like to avoid this operation as it creates a large array – Görg Sep 13 '22 at 12:57
  • 2
    How big are values in practice? If `Idx` is very random and `vector` is very large, then memory accesses can be pretty slow. I doubt the provided code can be an issue. Please update it to reflect the actual problem you have on your machine (so to make it *reproducible*). – Jérôme Richard Sep 13 '22 at 13:11
  • 1
    @Okapi575 I beg to differ. Since that is [advanced indexing](https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing), it will create a copy of the original array, and for sufficiently large inputs, this can be a decent contributor to the overall time. – norok2 Sep 13 '22 at 13:31

1 Answers1

5

For larger data arrays

import numpy as np

matrix = np.arange(50*60*70*80).reshape(50, 60, 70, 80)
vector = np.arange(1, 50, 2)
Idx = (matrix+np.random.randint(0, vector.size, size=matrix.shape))%vector.size

parallel numba speeds up the computation and avoids creating matVec.

On a 4-core Intel Xeon Platinum 8259CL

matVec = vector[Idx]
# %timeit 48.4 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

res = np.einsum('ijkl, ijkl -> ij', matrix, matVec)
# %timeit 26.9 ms ± 40.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

against an unoptimized numba implementation

import numba as nb

@nb.njit(parallel=True)
def func(matrix, idx, vector):
    res = np.zeros((matrix.shape[0],matrix.shape[1]), dtype=matrix.dtype)
    for i in nb.prange(matrix.shape[0]):
        for j in range(matrix.shape[1]):
            for k in range(matrix.shape[2]):
                for l in range(matrix.shape[3]):
                    res[i,j] += matrix[i,j,k,l] * vector[idx[i,j,k,l]]
    return res

func(matrix, Idx, vector)  # compile run
func(matrix, Idx, vector)
# %timeit 21.7 ms ± 486 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# (48.4 + 26.9) / 21.7 = ~3.47x speed up

np.testing.assert_equal(func(matrix, Idx, vector), np.einsum('ijkl, ijkl -> ij', matrix, matVec))

Performance and further optimizations

The above Numba code should be memory-bound when dealing with complex numbers. Indeed, matrix and Idx are big and must be completely read from the relatively-slow RAM. matrix has a size of 41*41*92*92*8*2 = 217.10 MiB and Idx a size of either 41*41*92*92*8 = 108.55 MiB or 41*41*92*92*4 = 54.28 MiB regarding the target platform (it should be of type int32 on most x86-64 Windows platforms and int64 on most Linux x86-64 platforms). This is also why vector[Idx] was slow: Numpy needs to write a big array in memory (not to mention writing data should be twice slower than reading it on x86-64 platforms in this case).

Assuming the code is memory bound, the only way to speed it up is to reduce the amount of data read from the RAM. This can be achieve by storing Idx in a uint16 array instead of the default np.int_ datatype (2~4 bigger). This is possible since vector.size is small. On a Linux with a i5-9600KF processor and a 38-40 GiB/s RAM, this optimization resulted in a ~29% speed up while the code is still mainly memory bound. The implementation is nearly optimal on this platform.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
Michael Szczesny
  • 4,911
  • 5
  • 15
  • 32
  • What would you do to optimize the `numba` implementation? – norok2 Sep 13 '22 at 13:27
  • @norok2 -I suppose you could try triggering compilation with avx calls for the trailing axis. But that almost never works for me. I usually only implement the trivial (parallel) loop solution. – Michael Szczesny Sep 13 '22 at 13:54
  • 2
    Good implementation! I added a note so to speed it up. One should care about the NUMA policy on computing servers as the default policy is generally to allocate pages locally on first touch. Since Numpy fill arrays mostly serially, the Numba code will use the RAM of only 1 NUMA node while computing servers generally have multiple NUMA nodes. `numactl --interleaved=XXX` can be used to mitigate this problem (see [this](https://stackoverflow.com/questions/62604334)) where `XXX` can be `0,1` for a platform with 2 NUMA ndoes (see `numactl --show`). – Jérôme Richard Sep 13 '22 at 18:48