0

Hi I'm trying to understand each step of cuda kernel. It will by nice to get all grid indexes that are occupy by data. My code is to add 2 vectors and is written in python numba.

n = 10 
x = np.arange(n).astype(np.float32)
y = x + 1   

setup number of threads and blocks in grid

threads_per_block = 8
blocks_per_grid = 2

Kernel

def kernel_manual_add(x, y, out):
    threads_number = cuda.blockDim.x
    block_number = cuda.gridDim.x

    thread_index = cuda.threadIdx.x
    block_index = cuda.blockIdx.x

    grid_index = thread_index + block_index * threads_number  
    threads_range = threads_number * block_number

    for i in range(grid_index, x.shape[0], threads_range):
        out[i] = x[i] + y[i]

Initialize kernel:

kernel_manual_add[blocks_per_grid, threads_per_block](x, y, out)

When i try to print out grid_index i get all input indexes 2*8.

How to get grid indexes (10 of them) that are used to compute data?

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • To repeat my last comment on your previous question, there is no automatic relationship between input dimensions and block and thread dimensions or the number of threads. There are no "grid indexes (10 of them) that are used to compute data". You are launching 16 threads. All of them run. – talonmies May 13 '18 at 19:33
  • But which of them compute data, and which one have empty run? –  May 13 '18 at 20:16
  • In the kernel in your question, all of them compute. Which will lead to a runtime error due to out of bounds memory access because you have more threads than data – talonmies May 13 '18 at 20:30
  • I don't get any error or this error is hidden by numba implementation. –  May 13 '18 at 20:57
  • I missed that you have bounds checking in your kernel. It is just about the only thing which you appear to have gotten correct. – talonmies May 14 '18 at 10:20

1 Answers1

1

The canonical way to write your kernel would be something like this

@cuda.jit
def kernel_manual_add(x, y, out):

    i = cuda.grid(1)
    if i < x.shape[0]:
        out[i] = x[i] + y[i]

You must run at least as many threads as there are elements in the input arrays. There is no magic here, you need to calculate grid and block dimensions manually before calling the kernel. See here and here for suggestions.

talonmies
  • 70,661
  • 34
  • 192
  • 269