By default np.einsum
does not optimize computations done since the cost to perform optimizations can sometimes be bigger than the computation itself (especially for small arrays). You can solve this using:
np.einsum("i, ijk -> jk", B, A, optimize='optimal')
With this, np.einsum
use internally a BLAS implementation (like OpenBLAS by default on most system including on my machine) which is far more efficient: it should run in parallel and use faster SIMD instructions (especially on x86 with AVX instead of SSE).
This operation is mostly memory bound and the above solution succeed to reach a memory throughput of ~30 GiB/s while the peak is about ~40/s GiB. This is very good (for a generic Numpy code) but sub-optimal.
Note the solution of @MichaelSzczesny gives similar performance results (with 64-bit floats).
Defeating OpenBLAS
While OpenBLAS is sub-optimal here, writing a faster code is far from being easy as such library is aggressively optimized to reach high-performance (AFAIK, they make use of architecture-specific kernels and even some hand-written assembly codes). That being said, they are written for generic cases and assume the computation is mostly compute bound which is not the case here.
A solution to defeat OpenBLAS is to write a parallel Numba code solving your specific case where: B
has a small fixed size, the last dimension of A
is always 2 and A
is relatively big. Numba is a Python module that can compile specific Python codes to native binary using a just-in-time compiler.
The main idea is to virtually split A
in blocks along the second dimension and then perform a cache-friendly weighted sum along the first dimension (that is not contiguously stored in memory). The code use assertions and compile-time constants so to help the compiler aggressively unrolling and vectorizing loops. The output matrix is passed in parameter of the function so not to pay the cost of expensive page-faults. Here is the final code:
import numpy as np
import numba as nb
block_size = 1024
B_size = 48
@nb.njit(fastmath=True, inline='never')
def compute_block(A, B, out, iStart, iEnd, complete):
n = A.shape[1]
# Check everything is Ok and help the compiler
# to optimize the loops more aggressively.
assert A.shape[0] == B_size
assert A.shape[2] == 2
assert out.shape[1] == 2
assert B.size == B_size
assert out.shape[0] == n
for i in range(iStart, iEnd):
out[i, 0] = out[i, 1] = 0.0
for j in range(B_size):
factor = B[j]
# Help the compiler to perform (better) loop unrolling
if complete:
for i in range(iStart, iStart+block_size):
out[i, 0] += A[j, i, 0] * factor
out[i, 1] += A[j, i, 1] * factor
else:
for i in range(iStart, iEnd):
out[i, 0] += A[j, i, 0] * factor
out[i, 1] += A[j, i, 1] * factor
@nb.njit('void(float64[:,:,::1], float64[::1], float64[:,::1])', parallel=True)
def compute(A, B, out):
n = A.shape[1]
# Parallel computation of many blocks
for bi in nb.prange(n // block_size):
iStart = bi * block_size
iEnd = iStart + block_size
compute_block(A, B, out, iStart, iEnd, True)
# Remaining part
bi = n // block_size
iStart = bi * block_size
if iStart < n:
compute_block(A, B, out, iStart, n, False)
# Example of use:
A = np.random.uniform(size=(48, 1000000, 2))
B = np.random.uniform(size=(48))
C = np.empty((A.shape[1], 2), dtype=np.float64)
compute(A, B, C)
Benchmark
Here are results on my i5-9600KF processor and a ~40 GiB DRAM:
np.einsum("i, ijk -> jk", B, A): 54.4 ms
np.einsum("i, ijk -> jk", B, A, optimize='optimal'): 25.2 ms
compute(A, B, C) 19.5 ms
Theoretical optimal time (lower bound): 18.6 ms
The final implementation succeed to reach a memory throughput of 38.2 GiB/s which is very close to the optimal on my machine.
Note that as pointed by @MichaelSzczesny, using 32-bit floats results in about 2 time faster timings since 32-bit floats takes 2 times less space in memory and the operation is memory bound. 32-bit floats are less precise though.