Speeding Up Kronecker Products Numpy

Question

So I am trying to compute the kronecker product of two matrices each of arbitrary dimension. (I use square matrices of the same dimension just for the examples)

Initially I tried using kron :

a = np.random.random((60,60))
b = np.random.random((60,60))

start = time.time()
a = np.kron(a,b)
end = time.time()

Output: 0.160096406936645

To try and get a speed up I used tensordot:

a = np.random.random((60,60))
b = np.random.random((60,60))

start = time.time()
a = np.tensordot(a,b,axes=0)
a = np.transpose(a,(0,2,1,3))
a = np.reshape(a,(3600,3600))
end = time.time()

Output: 0.11808371543884277

After searching the web a bit, I found that (or at least to my understanding) numpy makes an extra copy when it has to reshape a tensor that has been transposed.

So I then tried the following (this code obviously does not give the kronecker product of a and b, but I was just doing it as a test):

a = np.random.random((60,60))
b = np.random.random((60,60))

start = time.time()
a = np.tensordot(a,b,axes=0)
a = np.reshape(a,(3600,3600))
end = time.time()

Output: 0.052041053771972656

My question is: how can I compute the kronecker product without encountering this issue associated with the transpose?

I am just looking for a fast speed up, so the solution does not have to use tensordot.

EDIT

I just found on this stack post: speeding up numpy kronecker products, that there is another way to do it:

a = np.random.random((60,60))
b = np.random.random((60,60))

c = a

start = time.time()
a = a[:,np.newaxis,:,np.newaxis]
a = a[:,np.newaxis,:,np.newaxis]*b[np.newaxis,:,np.newaxis,:]
a.shape = (3600,3600)
end = time.time()

test = np.kron(c,b)
print(np.array_equal(a,test))
print(end-start)


Output: True
0.05503702163696289

I am still interested in the question of whether or not you can speed up this computation further?

Paul Panzer · Accepted Answer · 2019-05-10T02:47:37.163

einsum seems to work:

>>> a = np.random.random((60,60))
>>> b = np.random.random((60,60))
>>> ab = np.kron(a,b)
>>> abe = np.einsum('ik,jl', a, b).reshape(3600,3600)
>>> (abe==ab).all()
True
>>> timeit(lambda: np.kron(a, b), number=10)
1.0697475590277463
>>> timeit(lambda: np.einsum('ik,jl', a, b).reshape(3600,3600), number=10)
0.42500176999601535

Simple broadcasting is even faster:

>>> abb = (a[:, None, :, None]*b[None, :, None, :]).reshape(3600,3600)
>>> (abb==ab).all()
True
>>> timeit(lambda:  (a[:, None, :, None]*b[None, :, None, :]).reshape(3600,3600), number=10)
0.28011218502069823

UPDATE: Using blas and cython we can get another modest (30%) speedup. Decide for yourself whether it is worth the trouble.

[setup.py]

from distutils.core import setup
from Cython.Build import cythonize

setup(name='kronecker',
      ext_modules=cythonize("cythkrn.pyx"))

[cythkrn.pyx]

import cython
cimport scipy.linalg.cython_blas as blas
import numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def kron(double[:, ::1] a, double[:, ::1] b):
    cdef int i = a.shape[0]
    cdef int j = a.shape[1]
    cdef int k = b.shape[0]
    cdef int l = b.shape[1]
    cdef int onei = 1
    cdef double oned = 1
    cdef int m, n
    result = np.zeros((i*k, j*l), float)
    cdef double[:, ::1] result_v = result
    for n in range(i):
        for m in range(k):
            blas.dger(&l, &j, &oned, &b[m, 0], &onei, &a[n, 0], &onei, &result_v[m+k*n, 0], &l)
    return result

To build run cython cythkrn.pyx and then python3 setup.py build.

>>> from timeit import timeit
>>> import cythkrn
>>> import numpy as np
>>> 
>>> a = np.random.random((60,60))
>>> b = np.random.random((60,60))
>>>
>>> np.all(cythkrn.kron(a, b)==np.kron(a, b))
True
>>> 
>>> timeit(lambda: cythkrn.kron(a, b), number=10)
0.18925874299020506

thank you for the solution. I am also seeing the same issue with transpose in contracting indices with tensordot for higher dimensional tensors, but I see that einsum and broadcasting does not have an improvement over tensordot. Should that be expected or should it still have an advantage? — user1058860, May 09 '19 at 23:59
@user1058860 I don't know the inner workings of tensordot so I can't answer that. But I got curious as to how far one can push it, see updated post. I'd like to think that the blas/cython approach is as fast as it gets without investing an insane amount of development time, — Paul Panzer, May 10 '19 at 02:50
On which platform did you get this timings? On Windows (Core i5 8th gen) using your Cython implementation I get with pre- allocated memory 8ms single threaded/ 8ms multithreaded, without pre-allocated memory I get about 40ms single threaded/20ms multithreaded (almost exactly the same behaviour than my Numba implemantation) — max9111, May 10 '19 at 15:22
@max9111 It's a fairly old x86_64 linux machine, I don't think it has a superfast blas. Your observations are puzzling, for example I do not get any speedup to write home about from preallocating. That said this may be a flaw of my timings since I have no control over how the OS is recycling just freed memory. — Paul Panzer, May 10 '19 at 17:28

score 3 · Answer 2 · edited Jul 27 '20 at 22:18

Speeding up memory bound calculations

Avoid it completely, were possible (eg. the kron_and_sum example)
Blocked execution, when combined with other calculations
Maybe float32 intead of float64 is also enough
If this calculation is in a loop, allocate the memory only once

I got quite the same timings with this code and @Paul Panzers implementation, but on both implementations I get the same weird behaviour. With pre allocated memory, there is absolutely no speed up if the calculation is parallelized (this is expected), but without pre allocated memory there is quite a significant speed up.

Code

import numba as nb
import numpy as np


@nb.njit(fastmath=True,parallel=True)
def kron(A,B):
    out=np.empty((A.shape[0],B.shape[0],A.shape[1],B.shape[1]),dtype=A.dtype)
    for i in nb.prange(A.shape[0]):
        for j in range(B.shape[0]):
            for k in range(A.shape[1]):
                for l in range(B.shape[1]):
                    out[i,j,k,l]=A[i,k]*B[j,l]
    return out

@nb.njit(fastmath=True,parallel=False)
def kron_preallocated(A,B,out):
    for i in nb.prange(A.shape[0]):
        for j in range(B.shape[0]):
            for k in range(A.shape[1]):
                for l in range(B.shape[1]):
                    out[i,j,k,l]=A[i,k]*B[j,l]
    return out

@nb.njit(fastmath=True,parallel=True)
def kron_and_sum(A,B):
    out=0.
    for i in nb.prange(A.shape[0]):
        TMP=np.float32(0.)
        for j in range(B.shape[0]):
            for k in range(A.shape[1]):
                for l in range(B.shape[1]):
                    out+=A[i,k]*B[j,l]
    return out

Timings

#Create some data
out_float64=np.empty((a.shape[0],b.shape[0],a.shape[1],b.shape[1]),dtype=np.float64)
out_float32=np.empty((a.shape[0],b.shape[0],a.shape[1],b.shape[1]),dtype=np.float32)
a_float64 = np.random.random((60,60))
b_float64 = np.random.random((60,60))
a_float32 = a_float64.astype(np.float32)
b_float32 = b_float64.astype(np.float32)


#Reference
%timeit np.kron(a_float64,b_float64)
147 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#If you have to allocate memory for every calculation (float64)
%timeit B=kron(a_float64,b_float64).reshape(3600,3600)
17.6 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

#If you don't have to allocate memory for every calculation (float64)
%timeit B=kron_preallocated(a_float64,b_float64,out_float64).reshape(3600,3600)
8.08 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

#If you have to allocate memory for every calculation (float32)
%timeit B=kron(a_float32,b_float32).reshape(3600,3600)
9.27 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

#If you don't have to allocate memory for every calculation (float32)
%timeit B=kron_preallocated(a_float32,b_float32,out_float32).reshape(3600,3600)
3.95 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Example for a joined operation (sum of kroncker product)
#which isn't memory bottlenecked
%timeit B=kron_and_sum(a_float64,b_float64)
881 µs ± 104 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

score 1 · Answer 3 · answered Mar 29 '22 at 16:30

I've improved the np.kron a bit (57%) here: https://github.com/numpy/numpy/pull/21232

The idea was to get rid of concatenate which was coincidently causing a ValueError.

For what it's worth, since I stumbled upon this, @Paul Panzer seems to have a better solution with broadcast in his answer which I'm trying to add to NumPy now. You can keep a look out at https://github.com/numpy/numpy/issues/21257 for the progress. Thanks @Paul for the idea.

Speeding Up Kronecker Products Numpy

3 Answers3

Speeding up memory bound calculations

Linked