0

I have been playing around with numba and trying to implement a simple element-wise matrix multiplication. When using 'vectorize' I get the same result as the numpy multiplication but when I'm using 'cuda.jit' they are not same. Many of them are zeros. I'm providing a minimum working example for this purpose. Any help with the problem will be appreciated. I'm using numba o.35.0 and python 2.7

from __future__ import division
from __future__ import print_function

import numpy as np

from numba import vectorize, cuda, jit

M = 80
N = 40
P = 40

# Set the number of threads in a block
threadsperblock = 32

# Calculate the number of thread blocks in the grid
blockspergrid = (M*N*P + (threadsperblock - 1)) // threadsperblock

@vectorize(['float32(float32,float32)'], target='cuda')
def VectorMult3d(a, b):
   return a*b

@cuda.jit('void(float32[:, :, :], float32[:, :, :], float32[:, :, :])')
def mult_gpu_3d(a, b, c):
  [x, y, z] = cuda.grid(3)
  if x < c.shape[0] and y < c.shape[1] and z < c.shape[2]:
    c[x, y, z] = a[x, y, z] * b[x, y, z]

if __name__ == '__main__':
  A = np.random.normal(size=(M, N, P)).astype(np.float32)
  B = np.random.normal(size=(M, N, P)).astype(np.float32)

  numpy_C = A*B

  A_gpu = cuda.to_device(A)
  B_gpu = cuda.to_device(B)
  C_gpu = cuda.device_array((M,N,P), dtype=np.float32) # cuda.device_array_like(A_gpu)

  mult_gpu_3d[blockspergrid,threadsperblock](A_gpu,B_gpu,C_gpu)

  cudajit_C = C_gpu.copy_to_host()

  print('------- using cuda.jit -------')
  print('Is close?: {}'.format(np.allclose(numpy_C,cudajit_C)))
  print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,cudajit_C)), M*N*P))
  print('------- using cuda.jit -------\n')


  vectorize_C_gpu = VectorMult3d(A_gpu, B_gpu)
  vectorize_C = vectorize_C_gpu.copy_to_host()

  print('------- using vectorize -------')
  print('Is close?: {}'.format(np.allclose(numpy_C,vectorize_C)))
  print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,vectorize_C)), M*N*P))
  print('------- using vectorize -------\n')

  import numba; print("numba version: "+numba.__version__)
kgf3JfUtW
  • 13,702
  • 10
  • 57
  • 80
A Das
  • 817
  • 2
  • 10
  • 21

1 Answers1

2

Here is how you could debug this.

Consider a smaller and simplified example with:

  • reduced array sizes, e.g. (2, 3, 1) (so you could actually print the values and be able to read them)
  • simple and deterministic contents, e.g. "all ones" (to compare across runs)
  • additional kernel arguments for debugging

from __future__ import (division, print_function)

import numpy as np
from numba import cuda

M = 2
N = 3
P = 1

threadsperblock = 1
blockspergrid = (M * N * P + (threadsperblock - 1)) // threadsperblock


@cuda.jit
def mult_gpu_3d(a, b, c, grid_ran, grid_multed):
    grid = cuda.grid(3)
    x, y, z = grid

    grid_ran[x] = 1

    if (x < c.shape[0]) and (y < c.shape[1]) and (z < c.shape[2]):
        grid_multed[x] = 1
        c[grid] = a[grid] * b[grid]


if __name__ == '__main__':
    A = np.ones((M, N, P), np.int32)
    B = np.ones((M, N, P), np.int32)

    A_gpu = cuda.to_device(A)
    B_gpu = cuda.to_device(B)
    C_gpu = cuda.to_device(np.zeros_like(A))

    # Tells whether thread at index i have ran
    grid_ran = cuda.to_device(np.zeros([blockspergrid], np.int32))

    # Tells whether thread at index i have performed multiplication
    grid_multed = cuda.to_device(np.zeros(blockspergrid, np.int32))

    mult_gpu_3d[blockspergrid, threadsperblock](
        A_gpu, B_gpu, C_gpu, grid_ran, grid_multed)

    print("grid_ran.shape    : ", grid_ran.shape)
    print("grid_multed.shape : ", grid_multed.shape)
    print("C_gpu.shape       : ", C_gpu.shape)

    print("grid_ran          : ", grid_ran.copy_to_host())
    print("grid_multed       : ", grid_multed.copy_to_host())

    C = C_gpu.copy_to_host()
    print("C transpose flat  : ", C.T.flatten())
    print("C                 : \n", C)

Output:

grid_ran.shape    :  (6,)
grid_multed.shape :  (6,)
C_gpu.shape       :  (2, 3, 1)
grid_ran          :  [1 1 1 1 1 1]
grid_multed       :  [1 1 0 0 0 0]
C transpose flat  :  [1 1 0 0 0 0]
C                 : 
 [[[1]
  [0]
  [0]]

 [[1]
  [0]
  [0]]]

You can see that the device grid shape does not correspond to the shape of the arrays: the grid is flat (M*N*P), while arrays are all 3-dimensional (M, N, P). That is, first dimension of the grid has indices in range 0..M*N*P-1 (0..5, totaling 6 values in my example), while first dimension of the array is only in 0..M-1 (0..1, totaling 2 values in my example). This mistake typically leads do out-of-bounds access, but you have protected your kernel with a conditional which cuts down the offending threads:

if (x <= c.shape[0])

This line does not allow threads with indices above M-1 (1 in my example) to run (well, sort of [1]), that is why no values are written and you get many zeros in the resulting array.

Possible solutions:

  • In general, you could use multidimensional kernel grid configuration, i.e. a 3D vector for blockspergrid instead of a scalar [2].
  • In particular, as elementwise multiplication is a map operation and does not depend on array shapes, you could flatten all 3 arrays to 1D arrays, run your kernel as is on 1D grid, then reshape the result back [3], [4].

References:

Ivan Aksamentov - Drop
  • 12,860
  • 3
  • 34
  • 61
  • Thanks a lot. Your explanations are quite clear. I took the advice to use multidimensional kernel grid configuration. Something like below. `threadsperblock = (4, 4, 4); blockspergrid_x = np.int(np.ceil(M / threadsperblock[0]))` Similarly setting blockspergrid_y and blockspergrid_z and then `blockspergrid = (blockspergrid_x, blockspergrid_y, blockspergrid_z)`. Finally calling `mult_gpu_3d` with `blockspergrid` and `threadsperblock`. The references provided by you are also great!! Thanks again. – A Das Oct 11 '17 at 22:45