Speedup with einsum and average over multiple arrays

Question

Given two matrices A and B with 4 indices each, A[i,j,n_i,m], B[i,j,n_i,m], I need to perform the tensor product of them for the first two indices, while keeping the remaning two. For this task, I use einsum. Moreover, I need to average the result over the last index, for which I just use .mean() to the specific axis.

C=np.einsum('ij...,kl...->ikjl...',A[:,:,:,:],B[:,:,:,:]).reshape(1024,1024,len(t_list),Mmax).mean(axis=3)

While this works, I would like to know if there is any optimal way of doing this because right now this is quite slow. I am afraid my bottleneck is the size of the matrices: While the original A and B matrices have shape (32,32,1000,10000), the resulting matrix is C=(1024,1024,1000), which is quite is costly.

However, using einsum and then averaging using .mean(axis=3) is still much faster than looping over Mmax, creating the arrays and then taking the average.

I should comment that what I am really after is the reduced matrix, or the partial traces of this matrix, i.e.

from qutip import *
dim=2*np.ones(2*5)
for n_i,n in enumerate(t_list):
    C_s=C[:,:,n_i]
    Qobj(C_s,dims=[dim,dim]).ptrace([0,1])

If possible, I would like to do this operation without having to allocate C directly, which I think is what makes the whole process slow.

Why are you using `A[:,:,:,:]` and `B[:,:,:,:]`? Do you actually need to copy both matrices before passing them to `np.einsum`? — David, Apr 27 '23 at 15:56
@David Both A and B are the results of a calculation I do. Basically, they are the solution to some stochastic differential equations, that is why I have 4 indices: 2 for the 2D matrices, 1 for the temporal dimension, and 1 for the number of trajectories. Unfortunately, what I need to do now is the tensor product of those matrices, and for that I use einsum. — J.Agusti, Apr 27 '23 at 16:16
But why do you copy them, i.e. `A[:,:,:,:]` instead of simply `A`? There's nothing here to indicate why that would be necessary and that will definitely slow things down (maybe not by too much, I don't know). — David, Apr 27 '23 at 16:19
@David Oh, oh no way.... I actually tried and you are right, I get quite a reduction in my computation time... I thought that A and A[:,:,:,:] were exactly the same objects! Thanks, this small correction helped me a lot! — J.Agusti, Apr 27 '23 at 16:32
J.Agusti as is probably obvious from my answer, I found your question very interesting. I have since been working on a package to do this sort of operation automatically: https://pypi.org/project/einsum-pipe/. I'd be very interested if you'd take a look at it and try using it in whatever you're working at. While I like einsum, I rarely have the chance to use it properly. If you do, I'd love you to try this. Any feedback would be welcome on GitHub! — David, May 15 '23 at 02:43

David · Accepted Answer · 2023-05-03T03:00:20.243

As mentioned in the comments, you were copying the input arrays which is unnecessary. In addition, as you suggested, it is possible to calculate the partial trace of the outer product directly using np.einsum, without allocating C. It won't help code readability though...

Looking at the implementation of ptrace and the arguments you passed to it (maybe next time try to distil an external function like this down to its Numpy equivalent yourself - it's probably easier to understand with some context), it appears you will end up calling _ptrace_dense (as found here), and specifically the else part of that function:

rhomat = np.trace(Q.full()
    .reshape(rd + rd)
    .transpose(qtrace + [nd + q for q in qtrace] +
                sel + [nd + q for q in sel])
    .reshape([np.prod(dtrace, dtype=np.int32),
            np.prod(dtrace, dtype=np.int32),
            np.prod(dkeep, dtype=np.int32),
            np.prod(dkeep, dtype=np.int32)]))

As you can see, this does some complex reshaping/transposition then takes the np.trace of the result. You can take advantage of the fact that the trace of an outer product is actually an inner product, both of which are exactly what np.einsum was made for. This removes a large number of multiplications and requires much less memory.

You can also sum the final dimension within your np.einsum and divide the output by the length of the final dimension to get rid of your .mean(axis=3).

In order to do it all in one np.einsum step, you'll have to first reshape one of your input arrays. Ultimately it boils down to:

A = A.reshape([4, 8, 4, 8, len(t_list), Mmax])

X = np.einsum('ikok...q,ll...q->io...', A, B).reshape((4, 4, len(t_list))) / Mmax

X now holds the stacked results of your for loop, i.e. all the ptrace results on the final axis.

This transformation isn't particularly intuitive, at least to me. However, the steps to get here are fairly easy, if long-winded.

I'll work through what I did in the specific case: You have two input arrays with shape (32, 32) and want the partial trace for the first two subsets/qubits/whatever. Broadcasting this operation across your other dimensions should be obvious (read here if it isn't).

Summarising what you're trying to do, it boils down to:

C = np.einsum('ij,kl->ikjl',A,B)
# While you reshape this output, this has no effect because internally `ptrace` reshapes it internally
X = C.reshape([2, 2, 2, ..., 2]) # Reshape it into a 2x2x...x2 array
Y = X.transpose([2, 3, ..., 9, 12, 13, ..., 19, 0, 1, 10, 11]) # Move the first two axes of the first and second original axes to the end
Z = Y.reshape([256, 256, 4, 4])
output = np.trace(Z)

This can be simplified significantly using the axis1 and axis2 (not sure why the Qutip library doesn't do this):

C = np.einsum('ij,kl->ikjl',A,B)
X = C.reshape([4, 256, 4, 256])
output = np.trace(X, axis1=1, axis2=3)

Replacing np.trace:

C = np.einsum('ij,kl->ikjl', A, B)
X = C.reshape([4, 256, 4, 256])
output = np.einsum('ijkj->ik', X)

Two slightly non-intuitive things that may help explain the next few steps:

A = np.random.rand(4, 4)
B = np.random.rand(4, 4)
X = np.einsum('ij,kl->ijkl', A, B)  # Note this is ijkl NOT ikjl
Y = np.einsum('i,j->ij', A.flatten(), B.flatten())
assert np.allclose(X.flatten(), Y.flatten())

and (though not as important):

A = np.random.rand(4, 4)
B = np.random.rand(4, 4)
X = np.einsum('ij,kl->ikjl', A, B)
Y = np.einsum('ij,kl->ijkl', A, B).transpose((0, 2, 1, 3)) # Or .swapaxes(1, 2)
assert np.allclose(X.flatten(), Y.flatten())

This first point means that:

A = np.random.rand(32, 32)
B = np.random.rand(32, 32)
C = np.einsum('ij,kl->ikjl', A, B)
A = A.reshape((4, 8, 4, 8))
B = B.reshape((4, 8, 4, 8))
D = np.einsum('ij kl,mn op->ij mn kl op', A, B) # Spaces have no meaning, just help map the dimension groups to previous names,
# So i=ij, j=kl, k=mn and l=op
assert np.allclose(C.flatten(), D.flatten())

This is helpful because in order to merge the two np.einsums, you need to be able to split A and B (well, maybe not B, more on that later) into 32x32 for the first np.einsum and 4x256 for the second, as previously shown. As such, A and B's shape have to be (4, 8, 4, 8), since 4*8=32, satisfying the first requirement, and 8*4*8=256, satisfying the second. That changes the previous simplification of the problem to:

A = np.random.rand(4, 8, 4, 8)
B = np.random.rand(4, 8, 4, 8)
C = np.einsum('ij kl,mn op->ij mn kl op', A, B)
output = np.einsum('i jmn k jmn->ik', C).reshape((4, 4))

Now all you have to do is merge the two np.einsums. This is easy, since ijmnklop from the first output map directly to ijmnkjmn of the second input. Therefore:

A = np.random.rand(4, 8, 4, 8)
B = np.random.rand(4, 8, 4, 8)
output = np.einsum('i j kj,m n mn->ik', A, B).reshape((4, 4))

It should be obvious here that B actually doesn't need 4 dimensions, only two. So this finally comes to:

A = np.random.rand(4, 8, 4, 8)
B = np.random.rand(32, 32)
output = np.einsum('i j kj,l l->ik', A, B).reshape((4, 4))

Obviously instead of using np.random like me, reshape your data to the correct shape.

This isn't a particularly satisfying explanation as the output code seems completely unrelated to the input code. However, I hope the steps help you understand how you can work through it and generalise it to other np.einsum cases.

Regarding your specific problem of generalising to other subsets, you can reshape A to 2x2x...x2 and select arbitrary indices from there. You don't have to modify B, given the previous (unexpected) simplification. For example, if you wanted 0 and 3 instead of 0 and 1:

A = A.reshape((2,)*10)
B = B.reshape((32, 32))
output = np.einsum('i jk l m o jk p m,z z->i l o p', A, B).reshape((4, 4))

Notice how all dimensions you want to "keep" are named with unique letters and repeated in the output, whereas all dimensions you want to trace are repeated in the input and don't appear in the output. B is simply traced, regardless of which dimensions are chosen.

Finally, you can write a function to generate the subscripts automatically:

def get_subscripts(keep_indices: List[int], length: int, start_at='b', trace_char='a') -> str:
    start = ord(start_at)
    in_subs = list(range(start, start + length))*2
    out_subs = [start + k for k in keep_indices]
    for i, k in enumerate(keep_indices):
        in_subs[length + k] = start + length + i
        out_subs.append(start + length + i)
    return ''.join(chr(x) for x in in_subs) + ',' + trace_char*2 + '->' + ''.join(chr(x) for x in out_subs)

subsets = [0, 1, 3]
A = A.reshape((2,)*10)
B = B.reshape((32, 32))
output = np.einsum(get_subscripts(subsets, 5), A, B).reshape((2**len(subsets),)*2)

Fun problem!

Edit: If you want to generalise it to any subsets, you will need to reshape both arrays and modify the subscripts function slightly:

def get_subscripts(keep_indices: List[int], length: int, start_at='a') -> str:
    start = ord(start_at)
    assert all(k < 2*length for k in keep_indices)
    in_subs_1 = list(range(start, start + length))*2
    in_subs_2 = list(range(start + length, start + 2*length))*2
    out_subs = [start + k for k in keep_indices]
    for i, k in enumerate(keep_indices):
        next_sub = start + 2*length + i
        if k < length:
            in_subs_1[length + k] = next_sub
        else:
            in_subs_2[k] = next_sub
        out_subs.append(next_sub)
    return ''.join(chr(x) for x in in_subs_1) + ',' + ''.join(chr(x) for x in in_subs_2) + '->' + ''.join(chr(x) for x in out_subs)

A = A.reshape((2,)*10)
B = B.reshape((2,)*10)

output = np.einsum(get_subscripts(subsets, 5), A, B).reshape((2**len(subsets),)*2)

Basically just apply the same logic used for the first array to the second. The only other change is to move some of the offsets.

Thanks for the detailed answer. I am trying to understand all the small details. First of all, just trying your code gets me an error: 'dimensions in operand 1 for collapsing index 'l' don't match (4 != 8)'. Then, I would like to understand how you are performing the partial trace in your einsum. To take the tensor product, the indices expand as 'ij...,kl...->ikjl...'. I do not fully see this in your answer, unfortuantely. — J.Agusti, Apr 29 '23 at 10:48
Regarding your error, it sounds like B doesn't have the shape you said it had in your question, (32, 32, ...). It sounds like you reshaped it to the same shape as A, which is incorrect. Regarding partial trace, is should be fairly obvious that trace is an inner product, but if it's not, there's a proof request on the maths SE: https://math.stackexchange.com/questions/476802/how-do-you-prove-that-trbt-a-is-a-inner-product. Regarding the tensor product, I fully get your confusion. If you can confirm that this does indeed get you the answer you need, I'll add an explanation of how I got here. — David, Apr 29 '23 at 13:48
I tested it and this is exactly what I need. However, this is a small detail. We have assumed here that I perform the partial trace between two subsets. In general, this is not what I will need. Specifically, given the C matrix I defined in my original post, I need to perform this: https://scicomp.stackexchange.com/questions/30052/calculate-partial-trace-of-an-outer-product-in-python (First answer, with the einsum answer). — J.Agusti, Apr 29 '23 at 15:33
Obviously when you try to optimise something it becomes less flexible. However, if you want to do the partial trace of the **first** n values, it's fairly simple to adapt this answer: just reshape `A` to (2*\*n, 2*\*(5-n), 2*\*n, 2**(5-n)) (in your case, given your original matrix is 2*\*5=32 -> 32x32). It should be simple to generalise this for other sizes, **if you only want to do the first n**. If you want to choose arbitrary indices, it's more complicated. I'll update my answer and indicate where you can do this, but it'll not be simple to follow. — David, Apr 29 '23 at 17:09
David, thanks again for the effort of teaching me this. It works perfectly. However, after playing with it, I realized that there are things I cannot do. I think we have all this time assumed we take partitions on the subsystem A and trace it, for example, subset=[0,1,3], or subset=[1,4]. I am now trying to extend it to take partitions on both systems A and B. For example, subset=[0,6] or [2,9]. In this case, it is no longer true that taking the inner product is independent on B. Is it enough to just extend the indices in the einsum to include this process? — J.Agusti, May 01 '23 at 10:24
@J.Agusti I didn't make any assumptions as I actually have very little knowledge of quantum computing (I'm trying to change this!). My answer was purely a "methodical" transformation of a series of einsum operations (which was why the question piqued my interest). Your comment actually clarifies why B could be simplified so much, which puzzled me. However, yes, I think you could just apply the same logic to B. Particularly if you just modify `get_subscripts`, remove the `trace_char` argument, and repeat the logic I implemented for A, I think it should work. Obviously now you have to reshape B. — David, May 02 '23 at 01:22
I have been trying for a while to modify the get_subscripts to get the string needed but I have failed miserably all this time... Could you help me understand how would I, assuming A = A.reshape((2,)*10) and B = B.reshape((2,)*10), get a string such that, let's say, i have subset=[0,9]? — J.Agusti, May 02 '23 at 17:44
@J.Agusti I updated my answer. If I were you I'd make sure I actually understood the rest of the process, that last bit shouldn't have been too difficult to implement. `get_subscripts` could probably be cleaned up a bit, but it does what it needs to so that's fine by me. — David, May 03 '23 at 03:02

Speedup with einsum and average over multiple arrays

1 Answers1