I have the following benchmarking script
import numpy as np
import timeit
import matplotlib.pyplot as plt
n1, n2, n3, n4, n5, m = (101, 33, 1, 32, 2, 32)
def make_arrays(aOrder, bOrder, cOrder):
a = np.random.randn(n1, m) + 1j * np.random.randn(n1, m)
b = np.random.randn(n2, m) + 1j * np.random.randn(n2, m)
c = np.random.randn(n1, n2, n3, n4, n5) + 1j * np.random.randn(n1, n2, n3, n4, n5)
return (
np.array(a, order=aOrder),
np.array(b, order=bOrder),
np.array(c, order=cOrder),
)
# used in B()
blockSize = 84
resA = []
resB = []
resC = []
sizes = np.unique(np.exp(np.linspace(2, 6, 8)).astype(np.int64))
numTrials = 10
# overrides m form above
for m in sizes:
a, b, c = make_arrays("F", "F", "F")
path = np.einsum_path(
a,
[0, 5],
b,
[1, 5],
c,
[0, 1, Ellipsis],
[Ellipsis, 5],
optimize="greedy",
)
def A():
np.einsum(
a,
[0, 5],
b,
[1, 5],
c,
[0, 1, 2, 3, 4],
[5, 2, 3, 4],
optimize="greedy",
order="F",
)
# print("einsum\n", res.flags)
def B():
numBlocks = int(a.shape[1] // blockSize)
np.concatenate(
tuple(
np.einsum(
c,
[1, 2, Ellipsis],
a[:, kk * blockSize : (kk + 1) * blockSize],
[1, 0],
b[:, kk * blockSize : (kk + 1) * blockSize],
[2, 0],
[0, Ellipsis],
optimize="greedy",
order="F",
)
for kk in range(numBlocks)
)
+ (
np.einsum(
c,
[1, 2, Ellipsis],
a[:, numBlocks * blockSize :],
[1, 0],
b[:, numBlocks * blockSize :],
[2, 0],
[0, Ellipsis],
optimize="greedy",
order="F",
),
),
axis=0,
)
def C():
tmp = np.einsum(a, [0, 5], c, [0, 1, 2, 3, 4], [1, 2, 3, 4, 5], order="F")
np.einsum(b, [1, 5], tmp, [1, 2, 3, 4, 5], [2, 3, 4, 5], order="F")
A()
B()
C()
measured = np.mean(timeit.repeat(A, number=numTrials, repeat=numTrials)) / (
numTrials * m
)
resA.append(measured)
measured = np.mean(timeit.repeat(B, number=numTrials, repeat=numTrials)) / (
numTrials * m
)
resB.append(measured)
measured = np.median(timeit.repeat(C, number=numTrials, repeat=numTrials)) / (
numTrials * m
)
resC.append(measured)
plt.figure()
plt.semilogy(sizes, resA, label="A")
plt.semilogy(sizes, resB, label="B")
plt.semilogy(sizes, resC, label="C")
plt.legend()
plt.show()
I think one only needs numpy
and matplotlib
to run it.
Approach A
is my naive way of using einsum, since I expect this to work well. After some time this code has been residing in my codebase, I noticed that after a certain size of m
, the computation just gets very slow. Hence, I went ahead and implemented B
, which is producing satisfactory results across the board, but especially in the beginning it looses against A
. Also, no matter how I toss and turn this in terms of memory-layout of the input and output-arrays, I see no noticeable difference in qualitative behavior.
Just to retain my sanity, I went ahead and tried out an even more naive way by using C
, which is superslow as expected.
On a very powerful AMD EPYC 7343 16-Core Processor
with numpy-intelpython, i.e. using MKL, where I have forced the MKL to only use one core for the computation, I get the following result:
Essentially I divide the runtime by the problem size m
, to get an estimate for the cost of a single "slice" of the problems.
To iron out any relation to the CPU or MKL, I used my Macbook Air with the M2 chip first with a vanilla numpy:
and then also with a numpy linking again the accelerateBLAS library to make use of the GPU, or whatever, don't really care. Then I get
So my question is: what the effing heck is wrong with A
?
As a kinda off-topic sidenote: I am also running the same code on cupy from time to time and there this behavior is not visible. There the corresponding version of A
is scaling nicely, at least for the exact same problem sizes as used above.
Another off-topic sidenote: If I use opt_einsum
for the contraction, basically with the code of A
, I get something similar to B
from above, but slightly slower.