0

I have two matrices A and B which are square and have the same shape. I would like to generate a third square matrix C of the same shape such that for example,

C[7,4] = A[0,0]*B[7,4] + A[1,1]*B[6,3] + A[2,2]*B[5,2] + A[3,3]*B[4,1] + A[4,4]*B[3,0]

Another eg:

C[5,9] = A[0,0]*B[5,9] + A[1,1]*B[4,8] + A[2,2]*B[3,7] + A[3,3]*B[2,6] + A[4,4]*B[1,5] + A[5,5]*B[0,4]

Both indices of B decrease by 1 for each term in the sum, until either index is 0. The indices of A and B must sum to the original index we are trying to populate in C. The indices of A are restricted to that of its diagonal.

I would ideally like to do this vectorized using numpy and in the fastest way possible.

hertz
  • 13
  • 2
  • Not a full answer but maybe a starting point: I think the `k`-th diagonal for `N * N` matrices can be calculated in one go as `np.cumsum(np.diag(A)[N-abs(k)::-1] * np.diag(B, k))`. See [np.diag](https://numpy.org/doc/stable/reference/generated/numpy.diag.html) for the index `k`. The trouble is vectorizing it further (since each diagonal has a different size) and writing it back (since there is no index `k` in `np.diag_indices`) – Homer512 Aug 07 '23 at 00:02
  • In real life, is this performed multiple times? If so, what commonalities apply between iterations: shape? A value? B value? This is important to produce a solution that takes advantage of preprocessing. – Reinderien Aug 07 '23 at 15:47

2 Answers2

1

One way to have a fully vectorized version (but at the cost of doing sometimes some useless multiplication with 0) is to create a matrix of the subdiagonals

import numpy as np
# Example
n=10
A=np.arange(n*n).reshape(n,n)
B=np.random.uniform(-1,1,(n,n))

# C computation
Adiag=np.diag(A)[None,None,:]
Bpadded=np.pad(B, ((n-1,0),(n-1,0)))
s1,s2=Bpadded.strides
Bdiags = np.lib.stride_tricks.as_strided(Bpadded[n-1:,n-1:], shape=(n,n,n), strides=(s1,s2,-s1-s2))
C = (Bdiags*Adiag).sum(axis=2)

# Verification
# it works for your examples
C[7,4],  A[0,0]*B[7,4] + A[1,1]*B[6,3] + A[2,2]*B[5,2] + A[3,3]*B[4,1] + A[4,4]*B[3,0]
# (-47.11186336152282, -47.11186336152282)
C[5,9], A[0,0]*B[5,9] + A[1,1]*B[4,8] + A[2,2]*B[3,7] + A[3,3]*B[2,6] + A[4,4]*B[1,5] + A[5,5]*B[0,4]
# (85.27133136528813, 85.27133136528813)

Some explanation: Adiags is easy. It is just the diagonal of A, but reshaped in fake 3d, with only axis 2 being size n (just for later broadcasting)
Bpadded is just a copy of B, but ensuring that there are enough 0 around

And the tricky part is Bdiags. It is a 3d matrix, whose data are the one of Bpadded (no memory copy. It is the same memory address). But with different strides. Since in an array, arr[i1,i2,i3] is the data at address addressof(arr[0,0,0]) + strides1*i1 + strides2*i2 + strides3*i3, in my case, Bdiags[i1,i2,i3] is therefore data at address addressOf(Bpadded[n-1,n-1])+i1*strides1+i2*strides2+i3*(-strides1-strides2) aka adressOf(Bpadded[n-1,n-1])+(i1-i3)*strides1 + (i2-i3)*strides2. So it is Bpadded[n-1+i1-i3, n-1+i2-i3]. Which is B[i1-i3,i2-i3] but for the 0 when i1-i3 or i2-i3 are negative.

So Bdiags is just what you want to multiply with A[k,k]

Rest is simple broadcasting/multiplication/sum.

chrslg
  • 9,023
  • 5
  • 17
  • 31
  • Very neat! However seems to be slower than @Reinderien's answer – hertz Aug 07 '23 at 15:58
  • @hertz Not on my computer, and for the size I've tested. But indeed, the gap is shrinking, not increasing, with size of matrix. So, I guess, after a while (contrarily to what I've claimed first, but I edited before anyone can see) big n are not in my favor. My first reasonning is still ok: the cost specific to this method is at most a factor 2 on the number of operations. That doesn't become more and more important with size. But it's reinder method specific cost that become, proportionaly less and less important: because only outer for loop is not vectorized. And the bigger the data (...) – chrslg Aug 08 '23 at 21:02
  • (...) the more negligible the cost of the outer loop is (for 3 nested loops `for i in range(n): for j in range(n): for k in range(n)`, it cost O(n) + O(n²) + O(n³), which obviously is O(n³), but the O(n) part still count for small n if it is 1000 time more. Which is what happens if you have vectorized the two inner loops `for i in range(n): vectorizedOpOnAll(j,k)`. But the bigger n, the less we care about the non-vectorization of `for i`, and its O(1000n), before the O(n3) of the vectorized part. – chrslg Aug 08 '23 at 21:05
  • So, it is a bit this situation. My method is O(2n³). Reinder's faster method is O(1000n+n³) (my notation doesn't make any sense, strictly speaking, since O(2n³) and O(1000n+n³) are both aliases of O(n³), but I think you get it). So at the end (I mean, when n is huge), the 1000x cost of what is not vectorized doesn't really matter, and the 2x cost of my `0*0+0*0` computation is what is left to matter. – chrslg Aug 08 '23 at 21:08
  • 1
    Hence the reason why reinder asked you "which size?" and "can we factor preprocessing?". Because, depending on the size, it is obviously not the same method that is the faster. – chrslg Aug 08 '23 at 21:11
0

The easy pass is vectorization over the innermost column with a dot product. You can vectorise one more dimension by using a rolled matrix product. The original shown with the other two methods:

import numpy as np
from numpy.random import default_rng


def diagdot_slow(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    C = np.zeros_like(A)
    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            trace_length = 1 + min(i, j)
            for t in range(trace_length):
                C[i, j] += A[t, t] * B[i - t, j - t]
    return C


# https://stackoverflow.com/a/56175538/313768
def indep_roll(arr: np.ndarray, shifts: np.ndarray, axis: int=1) -> np.ndarray:
    arr = np.swapaxes(arr,axis,-1)
    all_idcs = np.ogrid[[slice(0,n) for n in arr.shape]]

    # Convert to a positive shift
    shifts[shifts < 0] += arr.shape[-1]
    all_idcs[-1] = all_idcs[-1] - shifts[:, np.newaxis]

    result = arr[tuple(all_idcs)]
    arr = np.swapaxes(result,-1,axis)
    return arr


def diagdot_inner_dot(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    Adiag = A.diagonal()
    C = np.empty_like(A)
    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            trace_length = 1 + min(i, j)
            Bdiag = B.diagonal(offset=j-i)[trace_length-1::-1]
            C[i, j] = Adiag[:trace_length].dot(Bdiag)
    return C


def diagdot_roll(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    Adiag = A.diagonal()
    n = A.shape[0]
    C = np.empty_like(A)
    shifts = np.arange(0, -n, -1)
    Brolly = indep_roll(arr=B, shifts=shifts, axis=0)
    Brollx = indep_roll(arr=B, shifts=shifts, axis=1)

    for trace_len in range(1, n+1):
        Afactor = Adiag[:trace_len]
        short_dim = slice(trace_len-n-1, -n-1, -1)
        long_dim = n - trace_len + 1

        # Vertical segment; include the primary diagonal
        Bfactory = Brolly[:long_dim, short_dim]
        C[trace_len - 1:, trace_len - 1] = Bfactory @ Afactor

        # Horizontal segment; exclude the primary diagonal
        Bfactorx = Brollx[short_dim, 1: long_dim]
        C[trace_len - 1, trace_len:] = Afactor @ Bfactorx

    return C


def test() -> None:
    rand = default_rng(seed=0)
    n = 20
    A = rand.random((n, n))
    B = rand.random((n, n))
    Cref = diagdot_slow(A, B)

    for method in (diagdot_inner_dot, diagdot_roll):
        C = method(A, B)
        assert np.allclose(C, Cref)


if __name__ == '__main__':
    test()

Vectorising the last dimension is non-trivial because it's not rectangular.

Reinderien
  • 11,755
  • 5
  • 49
  • 77
  • 1
    I had a similar approach so far, but it looks like you beat me by a couple of minutes; no point in sharing it now. I'll have to see if I can think of how to vectorize it, though I'm not sure it's possible. – jared Aug 07 '23 at 01:15
  • I'm not sure if vectorising the last dimension is possible as each 'diagonal' at an index has a different length. – hertz Aug 07 '23 at 15:24
  • It's possible with a lot of index manipulation, which may end up being more costly than the alternative. – Reinderien Aug 07 '23 at 15:25
  • Having done a speed comparison your answer seems to be quicker than @chrslg 's answer – hertz Aug 07 '23 at 15:57
  • That's nice to hear, but: for what input size and which method? And there will be potentially a better solution, depending on your response to the pre-processing comment. – Reinderien Aug 07 '23 at 16:00