A simple question on CUDA threads in numba

Question

It's a very beginner oriented question. I have been working on regular python threads and C threads and learnt that I can create threads that run a specific function and they use semaphores and other sync primitives.

But, I am currently trying to learn Cuda using the numba's python based compiler. I have written the following code.

from numba import cuda
import numpy as np

@cuda.jit
def image_saturate(data):
    pos_x, pos_y = cuda.grid(2)

    if (pos_x, pos_y) <= data.shape:
        data[pos_x, pos_y] = 1

if __name__ == "__main__":
    image_quality = (128, 72)
    image = np.zeros(image_quality)

    thread_size = 32
    block_size = image_quality

    image_saturate[block_size, thread_size](image)

    print(image)

But, the thing that I feel weird is, I can change thread_size as I want and the result is the same - meaning the output is all ones as expected. But, the moment I change the the block_size weird things start happening and only that size of the original matrix gets filled with ones - so it's only a partial filling.

Form this I understand that the cuda.grid(2) returns the block coordinates. But, shouldn't I be able to get the actual thread coordinates and the block coordinates as well?

I am terribly new and I can't find any resources to learn online. It would be great if anyone can answer my question and also provide and resources for learning Cuda using Numba.

I think wouldn't matter in this case in my knowledge as integer tuples are element-wise compared in python iirc, meaning what I intended will still work as I just want both the coordinates to be less than equal to the max. Am not sure what you are trying to point out here though. — Rohan Asokan, Feb 20 '21 at 17:26
why would it be acceptable for the coordinates to be equal to the max? — Robert Crovella, Feb 20 '21 at 19:00

score 6 · Accepted Answer · answered Feb 20 '21 at 21:09

Form this I understand that the cuda.grid(2) returns the block coordinates.

That's not the case. That statement returns a fully-qualified 2D thread index. The range of the returned values will extend to the product of the block coordinates limit and the thread coordinates limit.

In CUDA, the grid dimension parameter for a kernel launch (the one you are calling block_size) specifies the grid dimension in terms of number of blocks (in each direction). The block dimension parameter for a kernel launch (the one you are calling thread_size) specifies the size of each block in the grid, in terms of the number of threads per block.

Therefore the total number of threads launched is equal to the product of the grid dimension(s) and the the block dimension(s). The total would be the product of all those things, in all dimensions. The total per dimension would be the product of the grid dimension in that direction and the block dimension in that direction.

So you have a questionable design choice, in that you have an image size and you are setting the grid dimension equal to the image size. This could only be sensible if you had only 1 thread per block. As you will discover by looking at any proper numba CUDA code (such as the one here) a typical approach is to divide the total desired dimension (in this case, the image size or dimension(s)), by the number of threads per block, to get the grid dimension.

When we do so, the cuda.grid() statement in your kernel code will return a tuple that has sensible ranges. In your case, it would return tuples to threads that correctly go from 0..127 in x, and 0..71 in y. The problem you have at the moment is that the cuda.grid() statement can return tuples that range from 0..((128*32)-1) in x, and that is unnecessary.

Of course, the goal of your if statement is to prevent out-of-bounds indexing, however the test of <= does not look right to me. This is the classical computer science off-by-1 error. Threads whose indices happen to match the limit returned by shape should be excluded.

But, the moment I change the the block_size weird things start happening and only that size of the original matrix gets filled with ones - so it's only a partial filling.

It's really not clear what your expectations are here. Your kernel design is such that each thread populates (at most) one output point. Therefore sensible grid sizing is to match the grid (the total threads in x, and the total threads in y) to the image dimensions. If you follow the above recommendations for grid sizing calculations, and then set your grid size to something less than your image size, I would expect that portions of your output image would not be populated. Don't do that. Or if you must do that, employ a grid-stride loop kernel design.

Having said all that, the following is how I would rewrite your code:

from numba import cuda
import numpy as np

@cuda.jit
def image_saturate(data):
    pos_x, pos_y = cuda.grid(2)

    if pos_x  < data.shape[0] and pos_y < data.shape[1]:
        data[pos_x, pos_y] = 1

if __name__ == "__main__":
    image_x = 128
    image_y = 72
    image_quality = (image_x, image_y)
    image = np.zeros(image_quality)
    thread_x = 32
    thread_y = 1
    thread_size = (thread_x, thread_y)
    block_size = ((image_x//thread_x) + 1, (image_y//thread_y) + 1) # "lazy" round-up

    image_saturate[block_size, thread_size](image)

    print(image)

it appears to run correctly for me. If you now suggest that what you want to do is to arbitrary modify the block_size variable e.g.:

block_size = (5,5)

and make no other changes, and expect the output image to be fully populated, I would say that is not a sensible expectation. I have no idea how that could be sensible, so I will just say that CUDA doesn't work that way. If you wish to "decouple" the data size from the grid size, the canonical way to do it is the grid stride loop as already discussed.

I've also removed the tuple comparison. I don't think it is really germane here. If you still want to use the tuple comparison, that should work exactly as you would expect based on python. There isn't anything CUDA specific about it.

Thank for your very clear answer. After you've said all of this I think the error in my initial code was that I divided even the other co-ordinate by the block_size. I had only one dimension set for block_size, which meant both the dimensions of grid_size were getting scaled and hence the partial filling. Is there anyplace you had learnt CUDA from? — Rohan Asokan, Feb 21 '21 at 05:34
Can you maybe share some resources on where you learnt it from? I want to know about commonly used architectures and optimizations. — Rohan Asokan, Feb 21 '21 at 05:35
One resource is the [cuda programming guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html). Also [here](https://www.olcf.ornl.gov/cuda-training-series/) is a CUDA training series. — Robert Crovella, Feb 21 '21 at 06:37

RPeters · Answer 2 · 2021-03-14T08:55:30.000

I'm late to the game, but I thought an explanation with more visual components might help in lending some clarity to the existing answer. It is surprisingly hard to find illustrative answers for how thread indexing works in cuda. Though the concepts are very easy to hold in your head once you come to understand them, there can be many points for confusion along the way to getting that understanding, hopefully this helps.

Pardon the lack of discussion outside of the comments of the script, but this seems like a case where the context of the code will help avoid miscommunication and help to demonstrate the indexing concepts discussed by others. So I'll leave you with the script, the comments therein, and the compressed output.

To produce uncompressed output, see the first few comments in the thread-main block.

Example script:

from numba import cuda
import numpy as np

@cuda.jit
def image_saturate(data):
    grid_pos_x, grid_pos_y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    if (grid_pos_x, grid_pos_y) < data.shape:
        # note that the cuda device array retains the stride order of the original numpy array,
        # so the data is in row-major order (numpy default) and the first index (where we are
        # using grid_pos_x) doesn't actually map to the horizontal axis (aka the typical x-axis),
        # but rather maps to vertical axis (the typical y-axis).
        #
        # And, as you would then expecet, the second axis (where we are using grid_pos_y) maps
        # to the horizontal axis of the array.
        #
        # What you should take away from this observation is that the x,y labels of cuda elements
        # have no explicit connection to the array's memory layout.
        #
        # Therefore, it is up to you, the programmer to understand the memory layout for your
        # array (whether it's the C-like row-major, or the Fortran-like column-major), and how
        # you should map the (x,y) thread IDs onto the coordinates of your array.
        data[grid_pos_x, grid_pos_y,0,0] = tx
        data[grid_pos_x, grid_pos_y,0,1] = ty
        data[grid_pos_x, grid_pos_y,0,2] = tx + ty
        data[grid_pos_x, grid_pos_y,1,0] = bx
        data[grid_pos_x, grid_pos_y,1,1] = by
        data[grid_pos_x, grid_pos_y,1,2] = bx + by
        data[grid_pos_x, grid_pos_y,2,0] = grid_pos_x
        data[grid_pos_x, grid_pos_y,2,1] = grid_pos_y
        data[grid_pos_x, grid_pos_y,2,2] = grid_pos_x + grid_pos_y

if __name__ == "__main__":
    # uncomment the following line and remove the line after it
    # if you run this code to get more readable results
    # np.set_printoptions(linewidth=500)
    np.set_printoptions(linewidth=500,threshold=3) # compressed output for use on stack overflow
    # image_quality = (128, 72)
    # we are shrinking image_quality to be 23x21 to make the printout easier to read,
    # and intentionally step away from the alignment of threads per block being a
    # multiplicative factor to the image shape. THIS IS BAD for practical applications,
    # it's just helpful for this illustration.
    image_quality = (23,21)
    image = np.zeros(image_quality+(3,3),int)-1

    ## thread_size = 32 # commented to show where the original variable was used
    # below, we rename the variable to be more semantically clear
    #       Note: to define the desired thread-count in multiple axis, threads_per_block
    #             would have to become a tuple, E.G.:
    #               # defines 32 threads in thread-block's x axis, and 16 in the y
    #               threads_per_block = 32,16
    threads_per_block = 32
    #threads_per_block = 32  results in an implicit 1 for the y-axis, and implicit 1 in the z.
    ### Thread blocks are always 3d
    # this is also true for the thread grid the device will create for your kernel

    ## block_size = image_quality
    # renaming block_size to semantically more accurate variable name
    #   Note: As with the threads_per_block, any axis we don't explicitly specify a size
    #         for will be given the default value of 1. So, because image_quality gives 2 values,
    #         for x/y respectively, the z axis will implicitly be given size of 1.
    block_count = image_quality
    # REMEMBER: The thread/block/grid dimensions we are passing to the function compiler
    #           are NOT used to infer details about the arguments being passed to the
    #           compiled function (our image array)
    #           It is up to us to write code that appropriately utilizes the arrangement
    #           of the thread blocks the device will build for us once inside the kernel.
    #           SEE THE COMMENT INSIDE THE image_saturate function above.
    image_saturate[block_count, threads_per_block](image)
    print(f"{block_count=}; {threads_per_block=}; {image.shape=}")
    print("thread id within block; x")
    print(image[:,:,0,0])
    print("\nthread id within block; y"
          "\n-- NOTE 1 regarding all zeros: see comment at the end of printout")
    print(image[:,:,0,1])
    print("\nsum of x,y thread id within block")
    print(image[:,:,0,2])
    print("\nblock id within grid; x"
          "\n-- NOTE 2 also regarding all zeros: see second comment at the eod of printout")
    print(image[:,:,1,0])
    print("\nblock id within grid; y")
    print(image[:,:,1,1])
    print("\nsum of x,y block id within grid")
    print(image[:,:,1,2])
    print("\nthread unique global x id within full grid; x")
    print(image[:,:,2,0])
    print("\nthread unique global y id within full grid; y")
    print(image[:,:,2,1])
    print("\nsum of thread's unique global x,y ids")
    print(image[:,:,2,2])
    print(f"{'End of 32 threads_per_block output':-<70}")
    threads_per_block = 16
    # reset the values of image so we can be sure to see if any elements 
    # of the image go unassigned
    image *= 0
    image -= 1
    # block_count = image_quality # if you wanted to try
    print(f"\n\n{block_count=}; {threads_per_block=}; {image.shape=}")
    image_saturate[block_count, threads_per_block](image)
    print("thread id within block; x")
    print(image[:,:,0,0])
    print("\nthread id within block; y "
          "\n-- again, see NOTE 1")
    print(image[:,:,0,1])
    print("\nsum of x,y thread id within block")
    print(image[:,:,0,2])
    print("\nblock id within grid; x "
          "\n-- notice that unlike when we had 32 thread_per_block, not all 0")
    print(image[:,:,1,0])
    print("\nblock id within grid; y")
    print(image[:,:,1,1])
    print("\nsum of x,y block id within grid")
    print(image[:,:,1,2])
    print("\nthread unique global x id within full grid; x")
    print(image[:,:,2,0])
    print("\nthread unique global y id within full grid; y")
    print(image[:,:,2,1])
    print("\nsum of thread's unique global x,y ids")
    print(image[:,:,2,2])
    from textwrap import dedent
    print(dedent("""
    NOTE 1:
    The thread IDs recorded for 'thread id within block; y'
    are all zero for both versions of `threads_per_block` because we never
    specify the number of threads per block that should be created for
    the 'y' axis.

    So, the compiler defaults to creating only a single thread along those
    undefined axis of each block. For that reason, we see that the only
    threadID.y value stored is 0 for all i,j elements of the array.

    NOTE 2:
    **Note 2 mostly pertains to the case where threads_per_block == 32**
    The block IDs recorded for 'block id within grid; x' are all zero for
    both versions of `threads_per_block` results from similar reasons
    mentioned in NOTE 1.

    The size of a block, in any axis, is determined by the specified number
    of threads for that axis. In this example script, we define threads_per_block
    to have an explicit 32 threads in the x axis, leaving the compiler to give an
    implicit 1 for both the y and z axis. We then tell the compiler to create 23 blocks
    in the x-axis, and 21 blocks in the y; resulting in:
    \t* A kernel where the device creates a grid of blocks, 23:21:1 for 483 blocks
    \t\t* (x:y:z -> 23:21:1)
    \t* Where each block has 32 threads
    \t\t* (x:y:z -> 32:1:1)
    \t* And our image has height:width of 23:21 for 483 'pixels' in each
    \t  contrived layer of the image.

    As it is hopefully being made clear now, you should see that because each
    block has 32 threads on its x-axis, and we have only 23 elements on the corresponding
    axis in the image, only 1 of the 23 blocks the device created along the grid's x-axis
    will be used. Do note that the overhead of creating those unused blocks is a gross waste
    of GPU processor time and could potentially reduce the available resources to the block
    that does get used."""))

The output:

block_count=(23, 21); threads_per_block=32; image.shape=(23, 21, 3, 3)
thread id within block; x
[[ 0  0  0 ...  0  0  0]
 [ 1  1  1 ...  1  1  1]
 [ 2  2  2 ...  2  2  2]
 ...
 [20 20 20 ... 20 20 20]
 [21 21 21 ... 21 21 21]
 [22 22 22 ... 22 22 22]]

thread id within block; y
-- NOTE 1 regarding all zeros: see comment at the end of printout
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

sum of x,y thread id within block
[[ 0  0  0 ...  0  0  0]
 [ 1  1  1 ...  1  1  1]
 [ 2  2  2 ...  2  2  2]
 ...
 [20 20 20 ... 20 20 20]
 [21 21 21 ... 21 21 21]
 [22 22 22 ... 22 22 22]]

block id within grid; x
-- NOTE 2 also regarding all zeros: see second comment at the eod of printout
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

block id within grid; y
[[ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 ...
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]]

sum of x,y block id within grid
[[ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 ...
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]]

thread unique global x id within full grid; x
[[ 0  0  0 ...  0  0  0]
 [ 1  1  1 ...  1  1  1]
 [ 2  2  2 ...  2  2  2]
 ...
 [20 20 20 ... 20 20 20]
 [21 21 21 ... 21 21 21]
 [22 22 22 ... 22 22 22]]

thread unique global y id within full grid; y
[[ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 ...
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]]

sum of thread's unique global x,y ids
[[ 0  1  2 ... 18 19 20]
 [ 1  2  3 ... 19 20 21]
 [ 2  3  4 ... 20 21 22]
 ...
 [20 21 22 ... 38 39 40]
 [21 22 23 ... 39 40 41]
 [22 23 24 ... 40 41 42]]
End of 32 threads_per_block output------------------------------------


block_count=(23, 21); threads_per_block=16; image.shape=(23, 21, 3, 3)
thread id within block; x
[[0 0 0 ... 0 0 0]
 [1 1 1 ... 1 1 1]
 [2 2 2 ... 2 2 2]
 ...
 [4 4 4 ... 4 4 4]
 [5 5 5 ... 5 5 5]
 [6 6 6 ... 6 6 6]]

thread id within block; y 
-- again, see NOTE 1
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

sum of x,y thread id within block
[[0 0 0 ... 0 0 0]
 [1 1 1 ... 1 1 1]
 [2 2 2 ... 2 2 2]
 ...
 [4 4 4 ... 4 4 4]
 [5 5 5 ... 5 5 5]
 [6 6 6 ... 6 6 6]]

block id within grid; x 
-- notice that unlike when we had 32 thread_per_block, not all 0
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]]

block id within grid; y
[[ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 ...
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]]

sum of x,y block id within grid
[[ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 ...
 [ 1  2  3 ... 19 20 21]
 [ 1  2  3 ... 19 20 21]
 [ 1  2  3 ... 19 20 21]]

thread unique global x id within full grid; x
[[ 0  0  0 ...  0  0  0]
 [ 1  1  1 ...  1  1  1]
 [ 2  2  2 ...  2  2  2]
 ...
 [20 20 20 ... 20 20 20]
 [21 21 21 ... 21 21 21]
 [22 22 22 ... 22 22 22]]

thread unique global y id within full grid; y
[[ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 ...
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]]

sum of thread's unique global x,y ids
[[ 0  1  2 ... 18 19 20]
 [ 1  2  3 ... 19 20 21]
 [ 2  3  4 ... 20 21 22]
 ...
 [20 21 22 ... 38 39 40]
 [21 22 23 ... 39 40 41]
 [22 23 24 ... 40 41 42]]

NOTE 1:
The thread IDs recorded for 'thread id within block; y'
are all zero for both versions of `threads_per_block` because we never
specify the number of threads per block that should be created for
the 'y' axis.

So, the compiler defaults to creating only a single thread along those
undefined axis of each block. For that reason, we see that the only
threadID.y value stored is 0 for all i,j elements of the array.

NOTE 2:
**Note 2 mostly pertains to the case where threads_per_block == 32 
  is greater than the number of elements in the corresponding axis of the image**
The block.x IDs recorded for 'block id within grid; x' are all zero for
the `32` version of `threads_per_block` the relative difference in size between
the specified number of threads per block and the number of elements in the
image along the corresponding axis.

The size of a block, in any axis, is determined by the specified number
of threads for that axis. In this example script, we define threads_per_block
to have an explicit 32 threads in the x axis, leaving the compiler to give an
implicit 1 for both the y and z axis. We then tell the compiler to create 23 blocks
in the x-axis, and 21 blocks in the y; resulting in:
    * A kernel where the device creates a grid of blocks, 23:21:1 for 483 blocks
        * (x:y:z -> 23:21:1)
    * Where each block has 32 threads
        * (x:y:z -> 32:1:1)
    * And our image has height:width of 23:21 for 483 'pixels' in each
      contrived layer of the image.

As it is hopefully being made clear now, you should see that because each
block has 32 threads on its x-axis, and we have only 23 elements on the corresponding
axis in the image, only 1 of the 23 blocks the device created along the grid's x-axis
will be used. Do note that the overhead of creating those unused blocks is a gross waste
of GPU processor time and could potentially reduce the available resources to the block
that does get used.

A simple question on CUDA threads in numba

2 Answers2

Example script:

The output: