One way to have a fully vectorized version (but at the cost of doing sometimes some useless multiplication with 0) is to create a matrix of the subdiagonals
import numpy as np
# Example
n=10
A=np.arange(n*n).reshape(n,n)
B=np.random.uniform(-1,1,(n,n))
# C computation
Adiag=np.diag(A)[None,None,:]
Bpadded=np.pad(B, ((n-1,0),(n-1,0)))
s1,s2=Bpadded.strides
Bdiags = np.lib.stride_tricks.as_strided(Bpadded[n-1:,n-1:], shape=(n,n,n), strides=(s1,s2,-s1-s2))
C = (Bdiags*Adiag).sum(axis=2)
# Verification
# it works for your examples
C[7,4], A[0,0]*B[7,4] + A[1,1]*B[6,3] + A[2,2]*B[5,2] + A[3,3]*B[4,1] + A[4,4]*B[3,0]
# (-47.11186336152282, -47.11186336152282)
C[5,9], A[0,0]*B[5,9] + A[1,1]*B[4,8] + A[2,2]*B[3,7] + A[3,3]*B[2,6] + A[4,4]*B[1,5] + A[5,5]*B[0,4]
# (85.27133136528813, 85.27133136528813)
Some explanation:
Adiags is easy. It is just the diagonal of A, but reshaped in fake 3d, with only axis 2 being size n (just for later broadcasting)
Bpadded is just a copy of B, but ensuring that there are enough 0 around
And the tricky part is Bdiags. It is a 3d matrix, whose data are the one of Bpadded (no memory copy. It is the same memory address). But with different strides. Since in an array, arr[i1,i2,i3]
is the data at address addressof(arr[0,0,0]) + strides1*i1 + strides2*i2 + strides3*i3
, in my case, Bdiags[i1,i2,i3]
is therefore data at address addressOf(Bpadded[n-1,n-1])+i1*strides1+i2*strides2+i3*(-strides1-strides2)
aka adressOf(Bpadded[n-1,n-1])+(i1-i3)*strides1 + (i2-i3)*strides2
. So it is Bpadded[n-1+i1-i3, n-1+i2-i3]
. Which is B[i1-i3,i2-i3]
but for the 0 when i1-i3 or i2-i3 are negative.
So Bdiags is just what you want to multiply with A[k,k]
Rest is simple broadcasting/multiplication/sum.