Anaconda's NumbaPro CUDA Assertion Error

Question

I am trying to use NumbaPro's cuda extension to multiply large array matrixes. What I want in the end is to multiply a matrix of size NxN by a diagonal matrix that would be fed in as a 1D matrix (thus, a.dot(numpy.diagflat(b)) which I have found to be synonymous to a * b). However, I am getting an assertion error that provides no information.

I can only avoid this assertion error if I multiply two 1D array matrixes but that is not what I want to do.

from numbapro import vectorize, cuda
from numba import f4,f8
import numpy as np

def generate_input(n):
    import numpy as np
    A = np.array(np.random.sample((n,n)))
    B = np.array(np.random.sample(n) + 10)
    return A, B

def product(a, b):
    return a * b

def main():
    cu_product = vectorize([f4(f4, f4), f8(f8, f8)], target='gpu')(product)

    N = 1000

    A, B = generate_input(N)
    D = np.empty(A.shape)

    stream = cuda.stream()

    with stream.auto_synchronize():
        dA = cuda.to_device(A, stream)
        dB = cuda.to_device(B, stream)
        dD = cuda.to_device(D, stream, copy=False)
        cu_product(dA, dB, out=dD, stream=stream)
        dD.to_host(stream)

if __name__ == '__main__':
    main()

This is what my terminal spits out:

Traceback (most recent call last):
  File "cuda_vectorize.py", line 32, in <module>
    main()
  File "cuda_vectorize.py", line 28, in main
    cu_product(dA, dB, out=dD, stream=stream)
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 109, in __call__
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 191, in _arguments_requirement
AssertionError

Can't you check `_cudadispatch.py` at line 191 to see what the assertion is exactly? — BenC, Jun 18 '13 at 02:30
Unfortunately it only exists as compiled version. When I uncompiled it, the line numbers were different but from what I can tell, harrism is correct that the error is being throw due to it not being a scalar input. I also want to note the only decompiler that worked was Decompyle++ found in response to this question[1]. Additionally, the file location was "~/anaconda/lib/..." and not "/opt/anaconda1anaconda2anaconda3/..." [1]: http://stackoverflow.com/questions/8189352/decompile-python-2-7-pyc — brebs, Jun 20 '13 at 17:38

harrism · Accepted Answer · 2013-07-09T02:51:47.303

The problem is you are using vectorize on a function that takes non-scalar arguments. The idea with NumbaPro's vectorize is that it takes a scalar function as input, and generates a function that applies the scalar operation in parallel to all the elements of a vector. See the NumbaPro documentation.

Your function takes a matrix and a vector, which are definitely not scalar. [Edit] You can do what you want on the GPU using either NumbaPro's wrapper for cuBLAS, or by writing your own simple kernel function. Here's an example that demonstrates both. Note will need NumbaPro 0.12.2 or later (just released as of this edit).

from numbapro import jit, cuda
from numba import float32
import numbapro.cudalib.cublas as cublas
import numpy as np
from timeit import default_timer as timer

def generate_input(n):
    A = np.array(np.random.sample((n,n)), dtype=np.float32)
    B = np.array(np.random.sample(n), dtype=A.dtype)
    return A, B

@cuda.jit(argtypes=[float32[:,:], float32[:,:], float32[:]])
def diagproduct(c, a, b):
  startX, startY = cuda.grid(2)
  gridX = cuda.gridDim.x * cuda.blockDim.x;
  gridY = cuda.gridDim.y * cuda.blockDim.y;
  height, width = c.shape

  for y in range(startY, height, gridY):
    for x in range(startX, width, gridX):       
      c[y, x] = a[y, x] * b[x]

def main():

    N = 1000

    A, B = generate_input(N)
    D = np.empty(A.shape, dtype=A.dtype)
    E = np.zeros(A.shape, dtype=A.dtype)
    F = np.empty(A.shape, dtype=A.dtype)

    start = timer()
    E = np.dot(A, np.diag(B))
    numpy_time = timer() - start

    blas = cublas.api.Blas()

    start = timer()
    blas.gemm('N', 'N', N, N, N, 1.0, np.diag(B), A, 0.0, D)
    cublas_time = timer() - start

    diff = np.abs(D-E)
    print("Maximum CUBLAS error %f" % np.max(diff))

    blockdim = (32, 8)
    griddim  = (16, 16)

    start = timer()
    dA = cuda.to_device(A)
    dB = cuda.to_device(B)
    dF = cuda.to_device(F, copy=False)
    diagproduct[griddim, blockdim](dF, dA, dB)
    dF.to_host()
    cuda_time = timer() - start   

    diff = np.abs(F-E)
    print("Maximum CUDA error %f" % np.max(diff))

    print("Numpy took    %f seconds" % numpy_time)
    print("CUBLAS took   %f seconds, %0.2fx speedup" % (cublas_time, numpy_time / cublas_time)) 
    print("CUDA JIT took %f seconds, %0.2fx speedup" % (cuda_time, numpy_time / cuda_time))

if __name__ == '__main__':
    main()

The kernel is significantly faster because SGEMM does a full matrix-matrix multiply (O(n^3)), and expands the diagonal into a full matrix. The diagproduct function is smarter. It simply does a single multiply for each matrix element, and never expands the diagonal to a full matrix. Here are the results on my NVIDIA Tesla K20c GPU for N=1000:

Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took    0.024535 seconds
CUBLAS took   0.010345 seconds, 2.37x speedup
CUDA JIT took 0.004857 seconds, 5.05x speedup

The timing includes all of the copies to and from the GPU, which is a significant bottleneck for small matrices. If we set N to 10,000 and run again, we get a much bigger speedup:

Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took    7.245677 seconds
CUBLAS took   1.371524 seconds, 5.28x speedup
CUDA JIT took 0.264598 seconds, 27.38x speedup

For very small matrices, however, CUBLAS SGEMM has an optimized path so it is closer to the CUDA performance. Here, N=100

Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took    0.006876 seconds
CUBLAS took   0.001425 seconds, 4.83x speedup
CUDA JIT took 0.001313 seconds, 5.24x speedup

Thanks for pointing that out. Are you referring to the functions on this page under cuBLAS? I'm not seeing any functions that are used for simple matrix multiplication. Am I missing something? http://docs.continuum.io/numbapro/cudalib.html — brebs, Jun 24 '13 at 21:08
One approach is CUBLAS SGEMM, but matrix-matrix multiply is overkill when you know one matrix is diagonal. Another is to write a "CUDA Python" kernel -- this should be faster because you can specialize it for your case. I ran into bugs writing an example, though, so I'm waiting for help from Continuum before I update my answer. Thanks for accepting. — harrism, Jun 28 '13 at 07:06
OK, I updated my answer with a working example. Thanks for Continuum Analytics for quickly fixing the bugs I discovered in the process! — harrism, Jul 09 '13 at 02:54
Thank you so much! This is incredibly useful in not only making it faster but also for learning how to apply cuda to my functions. Unfortunately, I got a CUDA_ERROR_OUT_OF_MEMORY after N=8000. My goal is to run this on matrices with N=15,000. Do you know how much memory this would take/how much do you have to run N=10,000? — brebs, Jul 09 '13 at 22:10
No worries. What GPU do you have? 10000*10000 floats is 400MB. Each of the two methods (CUBLAS and custom kernel) would generate 3 arrays of that size. If you just use one method, you would need 1.2GB. If you use both methods (the code above, unchanged), and if NumbaPro doesn't reclaim the memory between them, it would be double that. — harrism, Jul 09 '13 at 22:45
I have a GeForce GT 650M 1024 MB which looks like it won't be enough. Similarly my other computer is a GeForce GTX 460 v2 (1024 MB). Are there any tricks that you know for increasing the GPU memory? If not I may just throw in a try/catch that will use a similar approach with just numba's cpu parallel processing for bigger matrices. — brebs, Jul 10 '13 at 15:11
The only trick to increase GPU memory is to buy a new GPU. :) Since this is a pretty trivial computation, you could easily partition it into chunks and process the chunks iteratively. — harrism, Jul 26 '13 at 05:09
Actually, my calculation was wrong for the diagproduct kernel: it does not promote the diagonal vector to a 2D matrix, so you would only have 2*400MB + 40KB of data, which should fit in your 1GB GPU. — harrism, Jul 26 '13 at 05:10

score 1 · Answer 2 · answered Oct 09 '14 at 18:34

Just to bounce back on all those considerations. I also wanted to implement some matrix computations on CUDA, but then heard about the numpy.einsum function. It turns out that einsum is incredibly fast. In a case like this, here is the code for it. But it can be applied to many types of computations.

G = np.einsum('ij,j -> ij',A, B)

In terms of speed, here are the results for N = 10000

Numpy took    8.387756 seconds
CUDA JIT took 0.218394 seconds, 38.41x speedup
EINSUM took 0.131751 seconds, 63.66x speedup

Anaconda's NumbaPro CUDA Assertion Error

2 Answers2