Optimizing array additions and multiplications with transposes

Question

I would like to do addition and multiplication on top of array transposes.

Given A is an array. sum += p(A) * factor(p)

where p is a permutation/transpose, factor(p) is a list of prefactors. For example, A is a 2 dimensional array, the target is

0.1*A + 0.1*transpose(A,(1,0))

I found in my Python code, when the higher dimensional permutations are applied, the time for array addition is far larger than transpose. Perhaps numpy.transpose uses pointer in C. I want to know, is there any method to optimize the timing for the array addition part? numpy.add does not help much. Should I somehow, only sum the part of transpose which affects the array, the rest part use multiplication? For example, a permutation (0,1,3,2), the (0,1) part multiplies a common factor on top of the previous array. Or use cpython to improve the performance?

Here is my Python code

import numpy as np
import time
import itertools as it

ref_list = [0, 1, 2, 3, 4]
p = it.permutations(ref_list)
transpose_list = tuple(p)
print(type(transpose_list),type(transpose_list[0]),transpose_list[0])


n_loop = 2
na = nb = nc = nd = ne = 20
A = np.random.random((na,nb,nc,nd,ne))
sum_A = np.zeros((na,nb,nc,nd,ne))
factor_list = [i*0.1 for i in range(120)]

time_transpose = 0
time_add = 0
time_multiply = 0

for n in range(n_loop):
    for m, t in enumerate(transpose_list):
        start = time.time()
        B = np.transpose(A, transpose_list[m] )  
        finish = time.time()
        time_transpose += finish - start

        start = time.time()
        B_p = B * factor_list[m]  
        finish = time.time()
        time_multiply += finish - start

        start = time.time()
        sum_A += B_p
        finish = time.time()
        time_add += finish - start

print(time_transpose, time_multiply, time_add, time_multiply/time_transpose, time_add/time_transpose)

The output is

0.004961967468261719 1.1218750476837158 3.7830252647399902 226.09480107630213 762.404285988852

The addition time is about ~700 folds larger than the transpose.

I tried to use numba's transpose in How to avoid huge overhead of single-threaded NumPy's transpose?

by adding


import numba as nb

@nb.njit('void(float64[:,::1], float64[:,::1])', parallel=True)
def transpose(mat, out):
    blockSize, tileSize = 256, 32  # To be tuned
    n, m = mat.shape
    assert blockSize % tileSize == 0
    for tmp in nb.prange((m+blockSize-1)//blockSize):
        i = tmp * blockSize
        for j in range(0, n, blockSize):
            tiMin, tiMax = i, min(i+blockSize, m)
            tjMin, tjMax = j, min(j+blockSize, n)
            for ti in range(tiMin, tiMax, tileSize):
                for tj in range(tjMin, tjMax, tileSize):
                    out[ti:ti+tileSize, tj:tj+tileSize] = mat[tj:tj+tileSize, ti:ti+tileSize].T

and use

B = transpose(A, transpose_list[m] )

received

Traceback (most recent call last):
  File "transpose_test_v2.py", line 46, in <module>
    B = transpose(A, transpose_list[m] )
  File "/home/.../lib/python3.8/site-packages/numba/core/dispatcher.py", line 717, in _explain_matching_error
    raise TypeError(msg)
TypeError: No matching definition for argument type(s) array(float64, 6d, C), UniTuple(int64 x 6)

or using B = nb.transpose(A, transpose_list[m] ) and got

    B = nb.transpose(A, transpose_list[m] )
AttributeError: 'int' object has no attribute 'transpose'

Jérôme Richard · Accepted Answer · 2022-02-20T16:16:40.730

1

np.transpose does not physically transpose the array in memory. It change the strides of the array so that the resulting view is (virtually) transposed. The thing is the memory layout of the view is very inefficient after the transposition so that operations on it are not efficient. You can perform a copy of the transposed view or call np.ascontiguousarray so to perform the transposition eagerly. Be aware that the current implementation of the Numpy copy of transposed arrays is quite inefficient.

Here are results on my machine:

Initial code:
0.0030066967010498047 1.2316536903381348 1.971841812133789 409.6368249940528 655.8166679882642

With a copy of the transposed array:
2.483657121658325 1.221949577331543 0.6028566360473633 0.4919960837893974 0.2427294133277298

Besides this, preallocating the arrays and using the parameter out of np.add and np.multiply should help to make the multiply and the add faster. For faster performance, you can use a parallel Numba code with loops. Optimizing the transposition in you case is pretty hard due to the high number of dimension.

Update:

Here is a faster Numba implementation that mostly removes the cost of the add/multiply and reduce the cost of the transposition using multiple threads:

# Same setup than in the question

@numba.njit('(float64[:,:,:,:,::1], float64[:,:,:,:,::1], float64, int_, int_, int_, int_, int_)', parallel=True, fastmath=True)
def computeRound(A, sum_A, factor, i0, i1, i2, i3, i4):
    size = 20
    B = np.transpose(A, (i0, i1, i2, i3, i4))
    for j0 in numba.prange(size):
        for j1 in range(size):
            for j2 in range(size):
                for j3 in range(size):
                    for j4 in range(size):
                        sum_A[j0, j1, j2, j3, j4] += B[j0, j1, j2, j3, j4] * factor

for n in range(n_loop):
    for m, t in enumerate(transpose_list):
        i0, i1, i2, i3, i4 = transpose_list[m]
        computeRound(A, sum_A, factor_list[m], i0, i1, i2, i3, i4)

This code is 4 times faster than the initial one. It is almost fully bound by the bad access pattern of the transposition. However, this is hard to optimize the transposition. One solution is to apply the transposition in an order that maximize cache locality from the previous transposition (fast but especially hard to implement). A simpler (but slower) solution is simply to manually work on the (i0, i1, i2, i4, i3) case at the same time than the (i0, i1, i2, i3, i4) case using the previously-provided optimized Numba transposition code.

edited Feb 20 '22 at 16:16

answered Feb 20 '22 at 13:11

Jérôme Richard

41,678
6
29
59

Thanks a lot. I added `A = np.ascontiguousarray(A, dtype=np.float32)` above the loops and `B = np.ascontiguousarray(B, dtype=np.float32)` rightafter `B = np.ascontiguousarray(B, dtype=np.float32)`, I got similar output as yours. The addition time reduced, but `B = np.ascontiguousarray(B, dtype=np.float32)` takes similar(slightly fewer) time. The total time does not reduce too much. Any better method? Will loading C/Fortran help to optimize the addition/multiplication part? `np.add(sum_A, B_p, out = sum_A ) ` does not help :( – Geositta Feb 20 '22 at 14:51
This is normal `ascontiguousarray` is not much faster as this is not the purpose of using it. The point was just to show that your measurement are bias due to the way the transposition works. The second part of the answer point out some way to get faster results. As for the transposition, the Numba solution I pointed out (see the answer's link) should make it faster. Note that recent version of Numba optimize the copy of transposed array as opposed to Numpy so it should be faster using it with a small effort. Still, transpositions are very expensive operations (like matrix multiplications). – Jérôme Richard Feb 20 '22 at 15:08
Thanks a lot. I tried to use `numba` and did not get it work :( Edited in the op. – Geositta Feb 20 '22 at 15:34
This is totally normal the Numba code does not work in your case. The Numba code is meant to work on a 2D array while you work on a 5D array (see the explicit signature and the linked question). You need to adapt the code and not just copy-past it. This is clearly not a simple thing to do though because of the high number of dimensions. – Jérôme Richard Feb 20 '22 at 15:37
1

@Geositta I implemented a significantly faster version using Numba. Not a silver-bullet but better. I also provided tips to optimize this. – Jérôme Richard Feb 20 '22 at 16:19
Thank you so much! I got faster by a factor of 4, and by a factor of 3 if I choose `parallel=False`! I will read into your code and get a better understanding. – Geositta Feb 20 '22 at 17:28
Thanks again. By the way, I tried to do a hybrid approach, kept the `for j0 in numba.prange(size):` loop, and used `np.add(sum_A[j0, :, :, :], B[j0, :, :, :] * factor, out = sum_A[j0, :, :, :])`, got `During: lowering "$130call_function_kw.55 = call $44load_attr.3($70binary_subscr.19, $100binary_multiply.37, func=$44load_attr.3, args=[Var($70binary_subscr.19, tran ...` error message. Is that possible to use `np.add` inside numba? If this comment more suitable for a separate question, just let me know. Thanks – Geositta Feb 25 '22 at 06:13
I think the `out` parameter is not yet supported. Keep in mind that Numba like basic loops (like the processor). Anyway, `B[j0, :, :, :] * factor` will create a pretty huge array that will slow down a lot the computation (especially on machine with many core since the RAM will be saturated with only few cores) and require significantly more memory. Thus, this is not a good idea. – Jérôme Richard Feb 25 '22 at 22:51

score 0 · Answer 2 · answered Feb 20 '22 at 16:39

0

Your error message looks like "(m+blockSize-1)//blockSize" part where division is giving you double precision float. Instead of dividing to blockSize(256), you can right-shift for 8 times and keep the integerness.

answered Feb 20 '22 at 16:39

huseyin tugrul buyukisik

11,469
4
45
97

Optimizing array additions and multiplications with transposes

2 Answers2

Linked