From your illustration, it seems that:
- The grid is composed of 16 thread blocks, numbered from 0 to 15.
- Each block is composed of 16 "SIMD threads", numbered from 0 to 15
- Each "SIMD thread" computes the product of 32 vector elements.
It is not necessarily obvious from the illustration whether "SIMD thread" means, in the CUDA (OpenCL) parlance:
- A warp (wavefront) of 32 threads (work-items)
or:
- A thread (work-item) working on 32 elements
I will assume the former ("SIMD thread" = warp/wavefront), since it is a more reasonable assumption performance-wise, but the latter isn't technically incorrect, it's simply suboptimal design (on current hardware, at least).
1) in threadblock 0 there are 15 SIMD threads. Are these 15 threads executed in parallel or just one thread at a specific time?
As stated above, there are 16 warps (numbered from 0 to 15, that makes 16) in thread block 0, each of them made of 32 threads. These threads execute in lockstep, simultaneously, in parallel. The warps are executed independently from each another, sequentially or in parallel, depending on the capabilities of the underlying hardware. For example, the hardware may be capable of scheduling a number of warps for simultaneous execution.
2) each block contains 512 elements in this example. is this number dependent on the hardware or is it a decision of the programmer?
In this case, it is simply a decision of the programmer, but in some cases there are also hardware limitations that could force the programmer into changing the design. For example, there is a maximum number of threads a block can handle, and there is a maximum number of blocks a grid can handle.