How do cuda threads are executed inside a single block?

Question

I have several question regarding cuda. Following is a figure taken from a book on parallel programming. It shows how threads are allocated in the device for a multiplication of two vectors each of length 8192.

enter image description here

1) in threadblock 0 there are 15 SIMD threads. Are these 15 threads executed in parallel or just one thread at a specific time?

2) each block contains 512 elements in this example. is this number dependent on the hardware or is it a decision of the programmer?

score 2 · Answer 1 · edited May 23 '17 at 11:50

2

1) In this particular example, each thread seems to be assigned to 32 elements in the vector. Code that is executed by a single thread is executed sequentially.

2) The size of the thread blocks is up to the programmer. However, there are restrictions on the number and size of the thread blocks given the hardware the code is executed on. For more information on this, see this elaborate answer: Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

edited May 23 '17 at 11:50

Community

1
1

answered Oct 07 '14 at 09:46

Jesse

315
2
8

i did not understand the answer for the 1st question – DesirePRG Oct 07 '14 at 09:52
I can give you a code example later today, writing code on a phone is tedious. :) – Jesse Oct 07 '14 at 10:20
Check Cicada's answer, it is better than this one. – Jesse Oct 07 '14 at 20:48

score 0 · Accepted Answer · answered Oct 07 '14 at 11:23

From your illustration, it seems that:

The grid is composed of 16 thread blocks, numbered from 0 to 15.
Each block is composed of 16 "SIMD threads", numbered from 0 to 15
Each "SIMD thread" computes the product of 32 vector elements.

It is not necessarily obvious from the illustration whether "SIMD thread" means, in the CUDA (OpenCL) parlance:

A warp (wavefront) of 32 threads (work-items)

or:

A thread (work-item) working on 32 elements

I will assume the former ("SIMD thread" = warp/wavefront), since it is a more reasonable assumption performance-wise, but the latter isn't technically incorrect, it's simply suboptimal design (on current hardware, at least).

1) in threadblock 0 there are 15 SIMD threads. Are these 15 threads executed in parallel or just one thread at a specific time?

As stated above, there are 16 warps (numbered from 0 to 15, that makes 16) in thread block 0, each of them made of 32 threads. These threads execute in lockstep, simultaneously, in parallel. The warps are executed independently from each another, sequentially or in parallel, depending on the capabilities of the underlying hardware. For example, the hardware may be capable of scheduling a number of warps for simultaneous execution.

2) each block contains 512 elements in this example. is this number dependent on the hardware or is it a decision of the programmer?

In this case, it is simply a decision of the programmer, but in some cases there are also hardware limitations that could force the programmer into changing the design. For example, there is a maximum number of threads a block can handle, and there is a maximum number of blocks a grid can handle.

How do cuda threads are executed inside a single block?

2 Answers2