I'm late to the game, but I thought an explanation with more visual components might help in lending some clarity to the existing answer. It is surprisingly hard to find illustrative answers for how thread indexing works in cuda. Though the concepts are very easy to hold in your head once you come to understand them, there can be many points for confusion along the way to getting that understanding, hopefully this helps.
Pardon the lack of discussion outside of the comments of the script, but this seems like a case where the context of the code will help avoid miscommunication and help to demonstrate the indexing concepts discussed by others. So I'll leave you with the script, the comments therein, and the compressed output.
To produce uncompressed output, see the first few comments in the thread-main block.
Example script:
from numba import cuda
import numpy as np
@cuda.jit
def image_saturate(data):
grid_pos_x, grid_pos_y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
if (grid_pos_x, grid_pos_y) < data.shape:
# note that the cuda device array retains the stride order of the original numpy array,
# so the data is in row-major order (numpy default) and the first index (where we are
# using grid_pos_x) doesn't actually map to the horizontal axis (aka the typical x-axis),
# but rather maps to vertical axis (the typical y-axis).
#
# And, as you would then expecet, the second axis (where we are using grid_pos_y) maps
# to the horizontal axis of the array.
#
# What you should take away from this observation is that the x,y labels of cuda elements
# have no explicit connection to the array's memory layout.
#
# Therefore, it is up to you, the programmer to understand the memory layout for your
# array (whether it's the C-like row-major, or the Fortran-like column-major), and how
# you should map the (x,y) thread IDs onto the coordinates of your array.
data[grid_pos_x, grid_pos_y,0,0] = tx
data[grid_pos_x, grid_pos_y,0,1] = ty
data[grid_pos_x, grid_pos_y,0,2] = tx + ty
data[grid_pos_x, grid_pos_y,1,0] = bx
data[grid_pos_x, grid_pos_y,1,1] = by
data[grid_pos_x, grid_pos_y,1,2] = bx + by
data[grid_pos_x, grid_pos_y,2,0] = grid_pos_x
data[grid_pos_x, grid_pos_y,2,1] = grid_pos_y
data[grid_pos_x, grid_pos_y,2,2] = grid_pos_x + grid_pos_y
if __name__ == "__main__":
# uncomment the following line and remove the line after it
# if you run this code to get more readable results
# np.set_printoptions(linewidth=500)
np.set_printoptions(linewidth=500,threshold=3) # compressed output for use on stack overflow
# image_quality = (128, 72)
# we are shrinking image_quality to be 23x21 to make the printout easier to read,
# and intentionally step away from the alignment of threads per block being a
# multiplicative factor to the image shape. THIS IS BAD for practical applications,
# it's just helpful for this illustration.
image_quality = (23,21)
image = np.zeros(image_quality+(3,3),int)-1
## thread_size = 32 # commented to show where the original variable was used
# below, we rename the variable to be more semantically clear
# Note: to define the desired thread-count in multiple axis, threads_per_block
# would have to become a tuple, E.G.:
# # defines 32 threads in thread-block's x axis, and 16 in the y
# threads_per_block = 32,16
threads_per_block = 32
#threads_per_block = 32 results in an implicit 1 for the y-axis, and implicit 1 in the z.
### Thread blocks are always 3d
# this is also true for the thread grid the device will create for your kernel
## block_size = image_quality
# renaming block_size to semantically more accurate variable name
# Note: As with the threads_per_block, any axis we don't explicitly specify a size
# for will be given the default value of 1. So, because image_quality gives 2 values,
# for x/y respectively, the z axis will implicitly be given size of 1.
block_count = image_quality
# REMEMBER: The thread/block/grid dimensions we are passing to the function compiler
# are NOT used to infer details about the arguments being passed to the
# compiled function (our image array)
# It is up to us to write code that appropriately utilizes the arrangement
# of the thread blocks the device will build for us once inside the kernel.
# SEE THE COMMENT INSIDE THE image_saturate function above.
image_saturate[block_count, threads_per_block](image)
print(f"{block_count=}; {threads_per_block=}; {image.shape=}")
print("thread id within block; x")
print(image[:,:,0,0])
print("\nthread id within block; y"
"\n-- NOTE 1 regarding all zeros: see comment at the end of printout")
print(image[:,:,0,1])
print("\nsum of x,y thread id within block")
print(image[:,:,0,2])
print("\nblock id within grid; x"
"\n-- NOTE 2 also regarding all zeros: see second comment at the eod of printout")
print(image[:,:,1,0])
print("\nblock id within grid; y")
print(image[:,:,1,1])
print("\nsum of x,y block id within grid")
print(image[:,:,1,2])
print("\nthread unique global x id within full grid; x")
print(image[:,:,2,0])
print("\nthread unique global y id within full grid; y")
print(image[:,:,2,1])
print("\nsum of thread's unique global x,y ids")
print(image[:,:,2,2])
print(f"{'End of 32 threads_per_block output':-<70}")
threads_per_block = 16
# reset the values of image so we can be sure to see if any elements
# of the image go unassigned
image *= 0
image -= 1
# block_count = image_quality # if you wanted to try
print(f"\n\n{block_count=}; {threads_per_block=}; {image.shape=}")
image_saturate[block_count, threads_per_block](image)
print("thread id within block; x")
print(image[:,:,0,0])
print("\nthread id within block; y "
"\n-- again, see NOTE 1")
print(image[:,:,0,1])
print("\nsum of x,y thread id within block")
print(image[:,:,0,2])
print("\nblock id within grid; x "
"\n-- notice that unlike when we had 32 thread_per_block, not all 0")
print(image[:,:,1,0])
print("\nblock id within grid; y")
print(image[:,:,1,1])
print("\nsum of x,y block id within grid")
print(image[:,:,1,2])
print("\nthread unique global x id within full grid; x")
print(image[:,:,2,0])
print("\nthread unique global y id within full grid; y")
print(image[:,:,2,1])
print("\nsum of thread's unique global x,y ids")
print(image[:,:,2,2])
from textwrap import dedent
print(dedent("""
NOTE 1:
The thread IDs recorded for 'thread id within block; y'
are all zero for both versions of `threads_per_block` because we never
specify the number of threads per block that should be created for
the 'y' axis.
So, the compiler defaults to creating only a single thread along those
undefined axis of each block. For that reason, we see that the only
threadID.y value stored is 0 for all i,j elements of the array.
NOTE 2:
**Note 2 mostly pertains to the case where threads_per_block == 32**
The block IDs recorded for 'block id within grid; x' are all zero for
both versions of `threads_per_block` results from similar reasons
mentioned in NOTE 1.
The size of a block, in any axis, is determined by the specified number
of threads for that axis. In this example script, we define threads_per_block
to have an explicit 32 threads in the x axis, leaving the compiler to give an
implicit 1 for both the y and z axis. We then tell the compiler to create 23 blocks
in the x-axis, and 21 blocks in the y; resulting in:
\t* A kernel where the device creates a grid of blocks, 23:21:1 for 483 blocks
\t\t* (x:y:z -> 23:21:1)
\t* Where each block has 32 threads
\t\t* (x:y:z -> 32:1:1)
\t* And our image has height:width of 23:21 for 483 'pixels' in each
\t contrived layer of the image.
As it is hopefully being made clear now, you should see that because each
block has 32 threads on its x-axis, and we have only 23 elements on the corresponding
axis in the image, only 1 of the 23 blocks the device created along the grid's x-axis
will be used. Do note that the overhead of creating those unused blocks is a gross waste
of GPU processor time and could potentially reduce the available resources to the block
that does get used."""))
The output:
block_count=(23, 21); threads_per_block=32; image.shape=(23, 21, 3, 3)
thread id within block; x
[[ 0 0 0 ... 0 0 0]
[ 1 1 1 ... 1 1 1]
[ 2 2 2 ... 2 2 2]
...
[20 20 20 ... 20 20 20]
[21 21 21 ... 21 21 21]
[22 22 22 ... 22 22 22]]
thread id within block; y
-- NOTE 1 regarding all zeros: see comment at the end of printout
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
sum of x,y thread id within block
[[ 0 0 0 ... 0 0 0]
[ 1 1 1 ... 1 1 1]
[ 2 2 2 ... 2 2 2]
...
[20 20 20 ... 20 20 20]
[21 21 21 ... 21 21 21]
[22 22 22 ... 22 22 22]]
block id within grid; x
-- NOTE 2 also regarding all zeros: see second comment at the eod of printout
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
block id within grid; y
[[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
...
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]]
sum of x,y block id within grid
[[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
...
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]]
thread unique global x id within full grid; x
[[ 0 0 0 ... 0 0 0]
[ 1 1 1 ... 1 1 1]
[ 2 2 2 ... 2 2 2]
...
[20 20 20 ... 20 20 20]
[21 21 21 ... 21 21 21]
[22 22 22 ... 22 22 22]]
thread unique global y id within full grid; y
[[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
...
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]]
sum of thread's unique global x,y ids
[[ 0 1 2 ... 18 19 20]
[ 1 2 3 ... 19 20 21]
[ 2 3 4 ... 20 21 22]
...
[20 21 22 ... 38 39 40]
[21 22 23 ... 39 40 41]
[22 23 24 ... 40 41 42]]
End of 32 threads_per_block output------------------------------------
block_count=(23, 21); threads_per_block=16; image.shape=(23, 21, 3, 3)
thread id within block; x
[[0 0 0 ... 0 0 0]
[1 1 1 ... 1 1 1]
[2 2 2 ... 2 2 2]
...
[4 4 4 ... 4 4 4]
[5 5 5 ... 5 5 5]
[6 6 6 ... 6 6 6]]
thread id within block; y
-- again, see NOTE 1
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
sum of x,y thread id within block
[[0 0 0 ... 0 0 0]
[1 1 1 ... 1 1 1]
[2 2 2 ... 2 2 2]
...
[4 4 4 ... 4 4 4]
[5 5 5 ... 5 5 5]
[6 6 6 ... 6 6 6]]
block id within grid; x
-- notice that unlike when we had 32 thread_per_block, not all 0
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[1 1 1 ... 1 1 1]
[1 1 1 ... 1 1 1]
[1 1 1 ... 1 1 1]]
block id within grid; y
[[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
...
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]]
sum of x,y block id within grid
[[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
...
[ 1 2 3 ... 19 20 21]
[ 1 2 3 ... 19 20 21]
[ 1 2 3 ... 19 20 21]]
thread unique global x id within full grid; x
[[ 0 0 0 ... 0 0 0]
[ 1 1 1 ... 1 1 1]
[ 2 2 2 ... 2 2 2]
...
[20 20 20 ... 20 20 20]
[21 21 21 ... 21 21 21]
[22 22 22 ... 22 22 22]]
thread unique global y id within full grid; y
[[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
...
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]]
sum of thread's unique global x,y ids
[[ 0 1 2 ... 18 19 20]
[ 1 2 3 ... 19 20 21]
[ 2 3 4 ... 20 21 22]
...
[20 21 22 ... 38 39 40]
[21 22 23 ... 39 40 41]
[22 23 24 ... 40 41 42]]
NOTE 1:
The thread IDs recorded for 'thread id within block; y'
are all zero for both versions of `threads_per_block` because we never
specify the number of threads per block that should be created for
the 'y' axis.
So, the compiler defaults to creating only a single thread along those
undefined axis of each block. For that reason, we see that the only
threadID.y value stored is 0 for all i,j elements of the array.
NOTE 2:
**Note 2 mostly pertains to the case where threads_per_block == 32
is greater than the number of elements in the corresponding axis of the image**
The block.x IDs recorded for 'block id within grid; x' are all zero for
the `32` version of `threads_per_block` the relative difference in size between
the specified number of threads per block and the number of elements in the
image along the corresponding axis.
The size of a block, in any axis, is determined by the specified number
of threads for that axis. In this example script, we define threads_per_block
to have an explicit 32 threads in the x axis, leaving the compiler to give an
implicit 1 for both the y and z axis. We then tell the compiler to create 23 blocks
in the x-axis, and 21 blocks in the y; resulting in:
* A kernel where the device creates a grid of blocks, 23:21:1 for 483 blocks
* (x:y:z -> 23:21:1)
* Where each block has 32 threads
* (x:y:z -> 32:1:1)
* And our image has height:width of 23:21 for 483 'pixels' in each
contrived layer of the image.
As it is hopefully being made clear now, you should see that because each
block has 32 threads on its x-axis, and we have only 23 elements on the corresponding
axis in the image, only 1 of the 23 blocks the device created along the grid's x-axis
will be used. Do note that the overhead of creating those unused blocks is a gross waste
of GPU processor time and could potentially reduce the available resources to the block
that does get used.