2

How to transform 100 of 8 element vectors into 10 16 element vectors using 1000 different (8,16) weight matrices? Each of the 10 output vectors is a sum of 100 dot products:

A = np.random.randn(100,8)
W = np.random.randn(1000,8,16)
B = []

for i in range(10):
  sum = np.zeros((1,16))
  for j in range(100):
    sum += np.dot(A[j], W[i*100+j])   
  B.append(sum)
B = np.asarray(B)     #B.shape=(10,16)

Is there a function in Numpy or TensorFlow for that? I looked at dot, tensordot, einsum, and matmul in Numpy, and I'm still not sure which one is the right option.

EDIT: I just realized that I actually want to produce an intermediate result before summing dot products: (100,8)x(10,100,8,16) -> (10,100,16).

I'm guessing this could be done with reshaping (100,8) to (1,100,1,8) and (1000,8,16) to (10,100,8,16), and doing np.einsum('ijkl,ijlm->ijm', A, B) But I'm not sure if it will broadcast 1 to 10 correctly.

Per @Divakar comment, np.einsum('jk,ijkl->ijl', V, W.reshape(10,100,8,16)) does the trick.

MichaelSB
  • 3,131
  • 3
  • 26
  • 40

2 Answers2

4

In one line, it's

B1 = np.einsum('ij,ikjl', A, np.reshape(W, (100, 10, 8, 16), order='F'))

Test it with np.allclose(B.squeeze(), B1) where you need .squeeze because your B has an extra dimension of size 1.

Explanation: your W has ugly shape, its first dimension of size 1000 should really be separated in 10 blocks of size 100 (as you in effect do with index manipulation in a loop). This is what reshape is for. The Fortran-style order is needed because we want to take out elements of W by changing the first index the fastest.

After that it's straightforward Einstein summation: over j to have matrix multiplication, over i to add 100 results.

  • Wasn't expecting this, but this is faster than `tensordot`. Also, you can avoid the fortran thing with : `np.einsum('jk,ijkl', A, np.reshape(W, (10, 100, 8, 16)))`. Added timings in my post. – Divakar Nov 16 '17 at 04:41
  • @Divakar, why doesn't this work: `np.einsum('ij,ikjl', A, np.reshape(W, (100, 10, 8, 16)))`? It produces the correct output shape, but wrong values... – MichaelSB Nov 16 '17 at 19:23
  • @MichaelSB Because we need to split the first axis of `W` such that the later axis is of length 100, which gets sum reduced against the first axis of A. – Divakar Nov 16 '17 at 19:47
  • @MichaelSB By default, C-style, the last index changes fastest. So reshaping 1000 elements into (100, 10) gets you 100 blocks of size 10, instead of 10 blocks of size 100. I worked around this by using Fortran style, Divakar by changing the order of axes. For a concrete example, compare `np.reshape(np.arange(6), (2, 3))`, `np.reshape(np.arange(6), (2, 3), order='F')`, and `np.reshape(np.arange(6), (3, 2))`. –  Nov 16 '17 at 19:59
  • @Divakar and 6' white male, thank you, I think I'm starting to get a clue. However, I just realized that I need an intermediate calculation saved as well. I edited my question. Do you mind taking a look? – MichaelSB Nov 16 '17 at 20:57
  • 1
    @MichaelSB `np.einsum('jk,ijkl->ijl', A, np.reshape(W, (10, 100, 8, 16)))` I think. – Divakar Nov 16 '17 at 21:00
2

You can use tensor based multiplication, np.tensordot -

def tensordot_app(A, W):
    m,n,r = W.shape
    Wr = W.reshape(-1,A.shape[0],n,r)
    return np.tensordot(A,Wr, axes=((0,1),(1,2)))

Related post to understand tensordot.

Runtime test -

In [62]: A = np.random.randn(100,8)
    ...: W = np.random.randn(1000,8,16)

In [63]: %%timeit 
    ...: B = []
    ...: for i in range(10):
    ...:   sum = np.zeros((1,16))
    ...:   for j in range(100):
    ...:     sum += np.dot(A[j], W[i*100+j])   
    ...:   B.append(sum)
    ...: B = np.asarray(B)     #B.shape=(10,16)
1000 loops, best of 3: 1.81 ms per loop

# Other post's einsum soln
In [64]: %timeit np.einsum('ij,ikjl',A,np.reshape(W,(100,10,8,16), order='F'))
10000 loops, best of 3: 83.4 µs per loop

# Other post's einsum soln without fortran re-ordering
In [65]: %timeit np.einsum('jk,ijkl', A, np.reshape(W, (10, 100, 8, 16)))
10000 loops, best of 3: 83.3 µs per loop

In [66]: %timeit tensordot_app(A, W)
10000 loops, best of 3: 193 µs per loop
Divakar
  • 218,885
  • 19
  • 262
  • 358