Following @DSM's comment, I'm assuming your C[k] == i
supposed to be B[k] == i
. If that's the case, does your loop version look something like this?
Nested Loop Version
import numpy as np
N = 5
M = 2
A = np.zeros((M,N))
B = np.random.randint(M, size=N) # contains indices for A
C = np.random.rand(N,N)
for i in range(M):
for j in range(N):
for k in range(N):
if B[k] == i:
A[i,j] += C[j,k]
There's more than one way to vectorize this problem. I'm going to show my thought process below, but there are more efficient ways to do it (e.g. @DSM's version that recognizes the matrix multiplication inherent in the problem).
For the sake of explanation, here's a walk-through of one approach.
Vectorizing the Inner Loop
Let's start by re-writing the inner k
loop:
for i in range(M):
for j in range(N):
A[i,j] = C[j, B == i].sum()
It might be easier to think of this as C[j][B == i].sum()
. We're just selecting j
th row of C
, selecting only the elements in that row where B
is equal to i
, and summing them.
Vectorizing the Outer-most Loop
Next let's break down the outer i
loop. Now we're going to get to the point where readability will start to suffer, unfortunately...
i = np.arange(M)[:,np.newaxis]
mask = (B == i).astype(int)
for j in range(N):
A[:,j] = (C[j] * mask).sum(axis=-1)
There are a couple different tricks here. In this case, we're iterating over the columns of A
. Each column of A
is the sum of a subset of the corresponding row of C
. The subset of the row of C
is determined by where B
is equal to the row index i
.
To get around iterating through i
, we're making a 2D array where B == i
by adding a new axis to i. (Have a look at the documentation for numpy broadcasting if you're confused by this.) In other words:
B:
array([1, 1, 1, 1, 0])
i:
array([[0],
[1]])
B == i:
array([[False, False, False, False, True],
[ True, True, True, True, False]], dtype=bool)
What we want is to take two (M
) filtered sums of C[j]
, one for each row in B == i
. This will give us a two-element vector corresponding to the j
th column in A
.
We can't do this by indexing C
directly because the result won't maintain it's shape, as each row may have a different number of elements. We'll get around this by multiplying the B == i
mask by the current row of C
, resulting in zeros where B == i
is False
, and the value in the current row of C
where it's true.
To do this, we need to turn the boolean array B == i
into integers:
mask = (B == i).astype(int):
array([[0, 0, 0, 0, 1],
[1, 1, 1, 1, 0]])
So when we multiply it by the current row of C
:
C[j]:
array([ 0.19844887, 0.44858679, 0.35370919, 0.84074259, 0.74513377])
C[j] * mask:
array([[ 0. , 0. , 0. , 0. , 0.74513377],
[ 0.19844887, 0.44858679, 0.35370919, 0.84074259, 0. ]])
Then we can sum over each row to get the current column of A
(This will be broadcast to a column when it's assigned to A[:,j]
):
(C[j] * mask).sum(axis=-1):
array([ 0.74513377, 1.84148744])
Fully Vectorized Version
Finally, breaking down the last loop, we can apply the exact same principle to add a third dimension for the loop over j
:
i = np.arange(M)[:,np.newaxis,np.newaxis]
mask = (B == i).astype(int)
A = (C * mask).sum(axis=-1)
@DSM's vectorized version
As @DSM suggested, you could also do:
A = (B == np.arange(M)[:,np.newaxis]).dot(C.T)
This is by far the fastest solution for most sizes of M
and N
, and arguably the most elegant (much more elegant than my solutions, anyway).
Let's break it down a bit.
The B == np.arange(M)[:,np.newaxis]
is exactly equivalent to B == i
in the "Vectorizing the Outer-most Loop" section above.
The key is in recognizing that all of the j
and k
loops are equivalent to matrix multiplication. dot
will cast the boolean B == i
array to the same dtype as C
behind-the-scenes, so we don't need to worry about explicitly casting it to a different type.
After that, we're just performing matrix multiplication on the transpose of C
(a 5x5 array) and the "mask" 0 and 1 array above, yielding a 2x5 array.
dot
will take advantage of any optimized BLAS libraries you have installed (e.g. ATLAS
, MKL
), so it's very fast.
Timings
For small M
's and N
's, the differences are less apparent (~6x between looping and DSM's version):
M, N = 2, 5
%timeit loops(B,C,M)
10000 loops, best of 3: 83 us per loop
%timeit k_vectorized(B,C,M)
10000 loops, best of 3: 106 us per loop
%timeit vectorized(B,C,M)
10000 loops, best of 3: 23.7 us per loop
%timeit askewchan(B,C,M)
10000 loops, best of 3: 42.7 us per loop
%timeit einsum(B,C,M)
100000 loops, best of 3: 15.2 us per loop
%timeit dsm(B,C,M)
100000 loops, best of 3: 13.9 us per loop
However, once M
and N
start to grow, the difference becomes very significant (~600x) (note the units!):
M, N = 50, 20
%timeit loops(B,C,M)
10 loops, best of 3: 50.3 ms per loop
%timeit k_vectorized(B,C,M)
100 loops, best of 3: 10.5 ms per loop
%timeit ik_vectorized(B,C,M)
1000 loops, best of 3: 963 us per loop
%timeit vectorized(B,C,M)
1000 loops, best of 3: 247 us per loop
%timeit askewchan(B,C,M)
1000 loops, best of 3: 493 us per loop
%timeit einsum(B,C,M)
10000 loops, best of 3: 134 us per loop
%timeit dsm(B,C,M)
10000 loops, best of 3: 80.2 us per loop