numpy: efficient, large dot products

Question

I am trying to perform a large linear-algebra computation to transform a generic covariance matrix KK_l_obs (shape (NL, NL))into a map of covariance matrices in a reduced space Kmap_PC (shape (q, q, X, Y)).

Information about how to construct Kmap_PC for each spatial location is held in other arrays a, I0, and k_l_th. The first two have shapes (X, Y), and the third (nl, nl). The transformation between the observed and reduced space is handed by eingenvectors E (shape (q, nl)). Note that NL > nl.

A spatial element of Kmap_PC is calculated as:

Kmap_PC[..., X, Y] = E.dot(
    KK_l_obs[I0[X, Y]: I0[X, Y] + nl,
             I0[X, Y]: I0[X, Y] + nl] / a_map[X, Y] + \
    k_l_th).dot(E.T)

The bit inside the first dot product could theoretically be computed straight using np.einsum, but would take up hundreds of GB of memory. What I am doing now is looping through the spatial indices of Kmap_PC, which is pretty slow. I could also distribute the calculation using MPI (which could probably give a speedup of 3-4x, since I have 16 cores available).

I'm wondering:

(a) if I can do the computation more efficiently--perhaps explicitly breaking it down into groups of spatial elements; and

(b) if I can improve the memory overhead for those calculations.

Code snippet

import numpy as np
np.random.seed(1)

X = 10
Y = 10
NL = 3000
nl = 1000
q = 7

a_map = 5. * np.random.rand(X, Y)
E = np.random.randn(q, nl)

# construct constant component
m1_ = .05 * np.random.rand(nl, nl)
k_l_th = m1_.dot(m1_)

# construct variable component
m2_ = np.random.rand(NL, NL)
KK_l_obs = m2_.dot(m2_.T)

# where to start in big cov
I0 = np.random.randint(0, NL - nl, (X, Y))

# the slow way
def looping():
    K_PC = np.empty((q, q, X, Y))
    inds = np.ndindex((X, Y))

    for si in inds:
        I0_ = I0[si[0], si[1]]
        K_PC[..., si[0], si[1]] = E.dot(
            KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] / a_map[si[0], si[1]] + k_l_th).dot(E.T)

    return K_PC

def veccalc():
    nl_ = np.arange(nl)[..., None, None]
    I, J = np.meshgrid(nl_, nl_)

    K_s = KK_l_obs[I0[..., None, None] + J, I0[..., None, None] + I]
    K_s = K_s / a_map[..., None, None] + k_l_th[None, None, ...]
    print(K_s.nbytes)

    K_PC = E @ K_s @ E.T
    K_PC = np.moveaxis(K_PC, [0, 1], [-2, -1])

    return K_PC

The subject line is misleading, sounding like you were creating an array from multiple `aranges` or something like that. Rather this is a large `dot` product question, `E.dot(A).dot(E.T)`. I'd like to see the `einsum` expression, and a small test case that I could run with simple copy-n-paste. It's hard to grasp the calculation just by reading your description. — hpaulj, Jan 30 '17 at 18:04
Just added an example with a looped implementation, and relatively small data dimensions. Working on `einsum`-based example now — DathosPachy, Jan 30 '17 at 18:46
So with these numbers you do 100 double dot products involving `(7,1000)@(1000,1000)@(1000,7) => (7,7)` . If I could do the `I0` mapping (handling both the indexing and memory size), the big problem would be `(7,1000)@(10,10,1000,1000)@(1000,7) -> (10,10,7,7) ` — hpaulj, Jan 30 '17 at 20:47
I have handled the `I0` mapping. Basically, the issue is that as `X, Y` approach 70 or so; and as `NL` and `nl` approach 3000 & 4000 (which is closer to what my real problem is), the intermediate matrix `K_s` gets very large. — DathosPachy, Jan 30 '17 at 22:05

Divakar · Accepted Answer · 2017-01-30T21:25:01.890

Tweak #1

One very simple performance tweak that's mostly ignored in NumPy is avoiding the use of division and using multiplication. This is not noticeable when dealing with scalar to scalar or array to array divisions when dealing with equal shaped arrays. But NumPy's implicit broadcasting makes it interesting for divisions that allow for broadcasting between arrays of different shapes or between an array and scalar. For those cases, we could get noticeable boost using multiplication with the reciprocal numbers. Thus, for the stated problem, we would pre-compute the reciprocal of a_map and use those for multiplications in place of divisions.

So, at the start do :

r_a_map = 1.0/a_map

Then, within the nested loops, use it as :

KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] * r_a_map[si[0], si[1]]

Tweak #2

We could use associative property of multiplication there :

A*(B + C) = A*B + A*C

Thus, k_l_th that is summed across all iterations but stays constant could be taken outside of the loop and summed up after getting out of the nested loops. It's effective summation would be : E.dot(k_l_th).dot(E.T). So, we would add this to K_PC.

Finalizing and benchmarking

Using tweak #1 and tweak#2, we would end up with a modified approach, like so -

def original_mod_app():
    r_a_map = 1.0/a_map
    K_PC = np.empty((q, q, X, Y))
    inds = np.ndindex((X, Y))
    for si in inds:
        I0_ = I0[si[0], si[1]]
        K_PC[..., si[0], si[1]] = E.dot(
            KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] * \
            r_a_map[si[0], si[1]]).dot(E.T)
    return K_PC + E.dot(k_l_th).dot(E.T)[:,:,None,None]

Runtime test with the same sample setup as used in the question -

In [458]: %timeit original_app()
1 loops, best of 3: 1.4 s per loop

In [459]: %timeit original_mod_app()
1 loops, best of 3: 677 ms per loop

In [460]: np.allclose(original_app(), original_mod_app())
Out[460]: True

So, we are getting a speedup of 2x+ there.

Would it be possible/advantageous to pull the multiplication by `r_a_map` out at the end of the loop as well? — DathosPachy, Jan 30 '17 at 22:07
@DathosPachy I have tried that and I have a fully vectorized version at my end with it, but its slower, so not uploading that one :) — Divakar, Jan 30 '17 at 22:10
Accepting this answer, since it gave a pretty substantial performance improvement. — DathosPachy, Jan 31 '17 at 18:59

hpaulj · Answer 2 · 2017-01-30T23:28:53.573

On a relatively modest machine (4G memory) a matmul calc on the whole 10x10x1000x1000 space works.

def looping2(n=2):
    ktemp = np.empty((n,n,nl,nl))
    for i,j in np.ndindex(ktemp.shape[:2]):
        I0_ = I0[i, j]
        temp = KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl]
        temp = temp / a_map[i,j] + k_l_th
        ktemp[i,j,...] = temp
    K_PC = E @ ktemp @ E.T      
    return K_PC

K = loop()
k4 = looping2(n=X)
np.allclose(k4, K.transpose(2,3,0,1))  # true

I haven't tried to vectorize the IO_ mapping. My focus is on generalizing the double dot product.

The equivalent einsum is:

K_PC = np.einsum('ij,...jk,lk->il...', E, ktemp, E)

That produces a ValueError: iterator is too large error for n=7.

But with the latest version

K_PC = np.einsum('ij,...jk,lk->il...', E, ktemp, E, optimize='optimal')

does work for the full 7x7x10x10 output.

Timings aren't promising. 2.2sec for the original looping, 3.9s for the big matmul (or einsum). (I get the same 2x speedup with original_mod_app)

============

time for constructing a (10,10,1000,1000) array (iteratively):

In [31]: %%timeit 
    ...:     ktemp = np.empty((n,n,nl,nl))
    ...:     for i,j in np.ndindex(ktemp.shape[:2]):
    ...:         I0_ = I0[i, j]
    ...:         temp = KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl]
    ...:         ktemp[i,j,...] = temp
    ...:     
1 loop, best of 3: 749 ms per loop

time for reducing that to (10,10,7,7) with @ (longer than the construction)

In [32]: timeit E @ ktemp @ E.T
1 loop, best of 3: 1.17 s per loop

time for the same two operations, but with the reduction in the loop

In [33]: %%timeit 
    ...:     ktemp = np.empty((n,n,q,q))
    ...:     for i,j in np.ndindex(ktemp.shape[:2]):
    ...:         I0_ = I0[i, j]
    ...:         temp = KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl]
    ...:         ktemp[i,j,...] = E @ temp @ E.T

1 loop, best of 3: 858 ms per loop

Performing the dot product within the loop reduces the size of the subarrays that are saved to ktemp, thus making up for the calculation cost. The dot operation on the big array is, by itself, more expensive than your loop. Even if we could 'vectorize' KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] it wouldn't make up for the cost handling that big array.

I also profiled my code snippets and figured out that the vectorized example didn't speed things up... — DathosPachy, Jan 30 '17 at 22:11
I've seen other cases where a modest number of iterations over smaller dot products is faster than one big calculation. If the iteration count is small relative to the total number of calculations, the iteration overhead is small. I suspect memory management issues slow down the big calculations. — hpaulj, Jan 30 '17 at 22:51
So with your loop we do a bit more calculating to create a (10,10,7,7) array, while I tried to make a (10,10,1000,1000) and then reduce it. — hpaulj, Jan 30 '17 at 23:26

numpy: efficient, large dot products

2 Answers2

Tweak #1

Tweak #2

Finalizing and benchmarking

Linked