2

I recently read this CUDA tutorial: https://developer.nvidia.com/blog/even-easier-introduction-cuda/ and one thing was unclear. When we sum two vectors we divide the task into several blocks and threads to do this in parallel. My question is why then the number of blocks (and maybe threads) doesn't depend on physical properties of GPU, the number of physical SMPs and threads?

For example let's say GPU has 16 SMPs and each of them can run 128 threads, will it be faster to split the problem into 16 blocks by 128 threads or, like in the article, split by 4000 blocks with 256 threads?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Denisof
  • 336
  • 1
  • 10
  • https://stackoverflow.com/a/9986748/681865 – talonmies Jan 13 '21 at 11:48
  • 1
    "For example let's say GPU has 16 SMPs and each of them can run 128 threads" There is no such animal. GPUs for the last 10 years or so can run a minimum of 1024 threads per SM. People commonly assume that the cores per SM is an important consideration here. **it is not**. The GPU is a latency-hiding machine. A proper explanation of this topic requires about 1 hour with a full slide deck. You can review that training [here](https://www.olcf.ornl.gov/cuda-training-series/) session 3 if you wish. – Robert Crovella Jan 13 '21 at 17:37
  • For those who are wondering about tagging -- CUDA isn't multithreading in the conventional sense of the word, it's ... CUDA, and that automatically implies a massively thread like execution model, so the tag is superfluous. There are questions tagged with both the multithreading and CUDA tags, those are specifically about questions regarding multithreaded host applications that use CUDA and thread safety of the CUDA APIs. This is not one of those questions, which is why I removed the tag – talonmies Jan 14 '21 at 01:00

1 Answers1

1

It does not depend because the number of threads will depend mainly on your problem size and the block size will depend on your GPU architecture. For example, if your GPU has 3000 cores and can have blocks of a maximum of 512, and your code will process a matrix with a size of 2 billion, you will have to specify the "number of blocks X number of threads per block(which is not greater than 512)" that will be EQUAL or GREATER than 2 billion, then CUDA will smartly partition your blocks of threads into your 3000 CUDA cores of your GPU until all of the threads specified by the "numBLocks X numThreadsPerBlock" have been called by the GPU.

Tiago Orl
  • 36
  • 4