Numpy einsum getting crazy slow after certain problem size

Question

I have the following benchmarking script

import numpy as np
import timeit
import matplotlib.pyplot as plt

n1, n2, n3, n4, n5, m = (101, 33, 1, 32, 2, 32)


def make_arrays(aOrder, bOrder, cOrder):
    a = np.random.randn(n1, m) + 1j * np.random.randn(n1, m)
    b = np.random.randn(n2, m) + 1j * np.random.randn(n2, m)
    c = np.random.randn(n1, n2, n3, n4, n5) + 1j * np.random.randn(n1, n2, n3, n4, n5)
    return (
        np.array(a, order=aOrder),
        np.array(b, order=bOrder),
        np.array(c, order=cOrder),
    )


# used in B()
blockSize = 84

resA = []
resB = []
resC = []

sizes = np.unique(np.exp(np.linspace(2, 6, 8)).astype(np.int64))
numTrials = 10

# overrides m form above
for m in sizes:
    a, b, c = make_arrays("F", "F", "F")

    path = np.einsum_path(
        a,
        [0, 5],
        b,
        [1, 5],
        c,
        [0, 1, Ellipsis],
        [Ellipsis, 5],
        optimize="greedy",
    )

    def A():
        np.einsum(
            a,
            [0, 5],
            b,
            [1, 5],
            c,
            [0, 1, 2, 3, 4],
            [5, 2, 3, 4],
            optimize="greedy",
            order="F",
        )
        # print("einsum\n", res.flags)

    def B():
        numBlocks = int(a.shape[1] // blockSize)
        np.concatenate(
            tuple(
                np.einsum(
                    c,
                    [1, 2, Ellipsis],
                    a[:, kk * blockSize : (kk + 1) * blockSize],
                    [1, 0],
                    b[:, kk * blockSize : (kk + 1) * blockSize],
                    [2, 0],
                    [0, Ellipsis],
                    optimize="greedy",
                    order="F",
                )
                for kk in range(numBlocks)
            )
            + (
                np.einsum(
                    c,
                    [1, 2, Ellipsis],
                    a[:, numBlocks * blockSize :],
                    [1, 0],
                    b[:, numBlocks * blockSize :],
                    [2, 0],
                    [0, Ellipsis],
                    optimize="greedy",
                    order="F",
                ),
            ),
            axis=0,
        )

    def C():
        tmp = np.einsum(a, [0, 5], c, [0, 1, 2, 3, 4], [1, 2, 3, 4, 5], order="F")
        np.einsum(b, [1, 5], tmp, [1, 2, 3, 4, 5], [2, 3, 4, 5], order="F")

    A()
    B()
    C()

    measured = np.mean(timeit.repeat(A, number=numTrials, repeat=numTrials)) / (
        numTrials * m
    )
    resA.append(measured)

    measured = np.mean(timeit.repeat(B, number=numTrials, repeat=numTrials)) / (
        numTrials * m
    )
    resB.append(measured)

    measured = np.median(timeit.repeat(C, number=numTrials, repeat=numTrials)) / (
        numTrials * m
    )
    resC.append(measured)

plt.figure()
plt.semilogy(sizes, resA, label="A")
plt.semilogy(sizes, resB, label="B")
plt.semilogy(sizes, resC, label="C")

plt.legend()
plt.show()

I think one only needs numpy and matplotlib to run it.

Approach A is my naive way of using einsum, since I expect this to work well. After some time this code has been residing in my codebase, I noticed that after a certain size of m, the computation just gets very slow. Hence, I went ahead and implemented B, which is producing satisfactory results across the board, but especially in the beginning it looses against A. Also, no matter how I toss and turn this in terms of memory-layout of the input and output-arrays, I see no noticeable difference in qualitative behavior.

Just to retain my sanity, I went ahead and tried out an even more naive way by using C, which is superslow as expected.

On a very powerful AMD EPYC 7343 16-Core Processor with numpy-intelpython, i.e. using MKL, where I have forced the MKL to only use one core for the computation, I get the following result:

Essentially I divide the runtime by the problem size m, to get an estimate for the cost of a single "slice" of the problems.

To iron out any relation to the CPU or MKL, I used my Macbook Air with the M2 chip first with a vanilla numpy:

and then also with a numpy linking again the accelerateBLAS library to make use of the GPU, or whatever, don't really care. Then I get

So my question is: what the effing heck is wrong with A?

As a kinda off-topic sidenote: I am also running the same code on cupy from time to time and there this behavior is not visible. There the corresponding version of A is scaling nicely, at least for the exact same problem sizes as used above.

Another off-topic sidenote: If I use opt_einsum for the contraction, basically with the code of A, I get something similar to B from above, but slightly slower.

As originally written `einsum` constructed a `nditer` problem with a iteration space involving all dimensions, which could easily get way too big. Now with the `path` and `greedy` break down, it becomes a sequence of calls the `matmul`. And as other SO have noted, if the memory use is too large, breaking the `matmul` into batches can be a time saver. There's a tradeoff between a 'few' python level iterations, and manipulating very large arrays. — hpaulj, Jul 17 '23 at 00:17
@hpaulj: Is the "too large" problem slow because the working set exceeds L2 or L3 cache size or something? The time per unit of work jumping up like that to a plateau indicates that something new became the bottleneck, and exceeding a cache size is a always a prime suspect when we're talking about making data larger. (I don't actually know what `einsum` calculates, so I'm just idly curious.) — Peter Cordes, Jul 17 '23 at 00:28
@PeterCordes, others can comment about the cache issue better than I can. — hpaulj, Jul 17 '23 at 00:37
For the record, EPYC 7343 is a 16c/32t Zen 3 with a total of 128 MiB of L3 cache. But that's 4 CCDs each with their own 32 MiB of L3 cache shared by the 4 cores per CCD. (https://www.techpowerup.com/cpu-specs/epyc-7343.c2388 / https://en.wikichip.org/wiki/amd/microarchitectures/zen_3#Architecture). Since the OP is testing single-threaded code, the total L3 cache it can benefit from is 32 MiB. (Zen 3 can have up to 8 cores per CCD, but this chip has less so there's more L3 per core, at the expense of smaller groups of cores that can communicate with the lowest latency.) — Peter Cordes, Jul 17 '23 at 00:49
I understand that cache misses might be a problem but usually they show in such plots as different slopes depending on the problem size. Also I suspect that the caches of the M2 and the EPYC should not be of the same size... — Labello, Jul 17 '23 at 01:19
Right, M2 (non-Pro non-Ultra) has a 16MiB L2 cache (shared by the four big cores: https://www.notebookcheck.net/Apple-M2-Processor-Benchmarks-and-Specs.632312.0.html / https://en.wikipedia.org/wiki/Apple_M2), only half the size of the L3 cache for one CCD of your Zen 3. So maybe an algorithmic crossover point, like matmul O(n^3) dominating whatever effects mattered more at small sizes. — Peter Cordes, Jul 17 '23 at 02:12
It seems that `A` runs slow when `m` is greater than `n1`. If `n1` is set to 200, `A` will run faster than `B` until `m=200`. I do not know what this means as I have no knowledge of einsum. But this doesn't sound like a CPU cache related issue. — ken, Jul 17 '23 at 06:40
That `A` jump might be due to a change in the `greedy` path. I tested `optimization='greedy'` with `m=300, n1=101`, and found that it changed the path to same thing as `optimization=False`. My interpretation is that it no longer tries to split the calculation into two `matmul` calls. My `matmul` equivalent does not see an unexpected jump in evaluation time. So either use a fixed path, or explore using `matmul` directly. `einsum` is doing too much "optimization" under the covers to be predicable. — hpaulj, Jul 17 '23 at 19:44
Thanks. I marked your response below as an accepted answer. :) Also this might explain why opt_einsum is behaving as it does, since I expect it to conduct calculations the way you do, which puts it somehow in the range of `B`. — Labello, Jul 18 '23 at 12:18

hpaulj · Accepted Answer · 2023-07-17T19:47:09.517

Just help me picture the action, the first try, with m=100 is

        np.einsum(
    ...:             a,
    ...:             [0, 5],
    ...:             b,
    ...:             [1, 5],
    ...:             c,
    ...:             [0, 1, 2, 3, 4],
    ...:             [5, 2, 3, 4],
    ...:             #optimize="greedy",
    ...:         ).shape
Out[15]: (100, 1, 32, 2)

default greedy show the same thing:

In [17]:         print(np.einsum_path(
    ...:             a,
    ...:             [0, 5],
    ...:             b,
    ...:             [1, 5],
    ...:             c,
    ...:             [0, 1, 2, 3, 4],
    ...:             [5, 2, 3, 4],
    ...:             optimize="greedy",
    ...:         )[1])
  Complete contraction:  af,bf,abcde->fcde
         Naive scaling:  6
     Optimized scaling:  6
      Naive FLOP count:  6.399e+07
  Optimized FLOP count:  4.308e+07
   Theoretical speedup:  1.485
  Largest intermediate:  2.112e+05 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   6             abcde,af->cedbf                           bf,cedbf->fcde
   5              cedbf,bf->fcde                               fcde->fcde

With optimize=False (none):

  Complete contraction:  af,bf,abcde->fcde
         Naive scaling:  6
     Optimized scaling:  6
      Naive FLOP count:  6.399e+07
  Optimized FLOP count:  6.399e+07
   Theoretical speedup:  1.000
  Largest intermediate:  6.400e+03 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   6           abcde,bf,af->fcde                               fcde->fcde

I picture this as a double matmul with f as the 'batchdimension,aandbthe two sum-of-product dimensions, andcde` reducible to 1 dimension ('going along for the ride').

With optimize=False, the timeit is

242 ms ± 941 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

with greedy:

21.5 ms ± 743 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The equivalent double matmul times the same

In [41]: timeit res=b.T[:,None,:]@(a.T@c.reshape(c.shape[0],-1)).reshape(m,c.shape[1],-1
    ...: )
22.3 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

in this res.shape is (100, 1, 64)

I wonder if this is a more straightforward matmul dimension layout?

'cde1ab,fb1->cdefa1; cdefa1,fa1->cdef11'

In [96]: c1 = c.transpose((2,3,4,0,1))[...,None,:,:]
In [97]: a1 = a.T[...,None]; b1 = b.T[...,None]
In [98]: res3 = ((c1@b1)[...,None,:,0]@a1)[...,0,0]
In [99]: res3.shape
Out[99]: (1, 32, 2, 100)

larger m`

If I change a and b to m=300, I get a different einsum_path

In [106]: print(np.einsum_path(
     ...:     ...:             abig,
     ...:     ...:             [0, 5],
     ...:     ...:             bbig,
     ...:     ...:             [1, 5],
     ...:     ...:             c,
     ...:     ...:             [0, 1, 2, 3, 4],
     ...:     ...:             [5, 2, 3, 4],
     ...:     ...:             optimize="greedy",
     ...:     ...:         )[1])
  Complete contraction:  af,bf,abcde->fcde
         Naive scaling:  6
     Optimized scaling:  6
      Naive FLOP count:  1.920e+08
  Optimized FLOP count:  1.920e+08
   Theoretical speedup:  1.000
  Largest intermediate:  1.920e+04 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   6           abcde,bf,af->fcde  

In [111]: %%timeit
     ...:         np.einsum(
     ...:             abig,
     ...:             [0, 5],
     ...:             bbig,
     ...:             [1, 5],
     ...:             c,
     ...:             [0, 1, 2, 3, 4],
     ...:             [5, 2, 3, 4],
     ...:             optimize=True,
     ...:         )
     ...: 
464 ms ± 2.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

matmul equivalent only jumps 3x, linear in m:

In [112]: a11 = abig.T[...,None]; b11 = bbig.T[...,None]
In [113]: timeit ((c1@b11)[...,None,:,0]@a11)[...,0,0].shape
74.9 ms ± 637 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [114]: a11.shape,b11.shape
Out[114]: ((300, 101, 1), (300, 33, 1))

Could you sum up what you wrote? I have trouble seeing bigger picture here — dankal444, Jul 17 '23 at 10:34
I was trying to understand your problem, stripping it down to more basic `matmul` like operations. Your `c` is really just a 3d array; also I'm more used to using the 'ijk' einsum notation. As for the speed issue, I wasn't interested in tying up my machine on some very large time consuming calculations. If it contributes nothing to your - or our collective - understanding, I can delete it. — hpaulj, Jul 17 '23 at 15:36
For `mutmul` purposes I'm moving the `m`, batch dimension, to the start. The canonical `matmul` calculation is, in einsum notation, `kmn, knp -> kmp`. where `k` is the batch dimension, `n` is sum-of-products`. That's what's passed to BLAS routines for maximum speed. In general the easier it is to pass arrays to the BLAS code, the faster the calculation will be. — hpaulj, Jul 17 '23 at 15:42
I tested `a` and `b` arrays that are (n,300) in size, (`m>n1`). The 'greedy' path no longer splits the calculation into two steps, and timing is in the ball park of `optimization=False`. My `matmul` just increases 3x from the (n,100) case. — hpaulj, Jul 17 '23 at 19:33
@hpaulj: I did some research, but could not find anything helpful, regarding the optimal memory layout of the array that is fed into batched GEMM. does the underlying BLAS call assume C-style or Fortran style layout? — Labello, Jul 18 '23 at 00:00

Numpy einsum getting crazy slow after certain problem size

1 Answers1

larger m`