0

Given the way blocks are scheduled to run on specific streaming multiprocessor, that blocks run to completion, and that there is a maximum to the number of blocks that is scheduled to a single streaming multiprocessor at any given time. Because of register constraints, shared memory constraints or compute capability level.

As CUDA picks a certain number of blocks to execute at the same time I gather that internally there at least must be some API or formula that lets CUDA determine this at runtime. However, is this API publicly available and is it documented somewhere?

The reason for asking is that we need a buffer of the size blocks_per_sm * sm, and due to memory constraints we would ideally like to keep this buffer as small as possible, especially if due to registry constraints we can run a lot less blocks than the maximum specified by the compute capability we would like to save that space.

talonmies
  • 70,661
  • 34
  • 192
  • 269
TheDutchDevil
  • 826
  • 11
  • 24
  • 4
    http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OCCUPANCY.html#group__CUDART__OCCUPANCY – talonmies Apr 23 '17 at 15:47
  • 3
    In addition to the actual API documentation, which @talonmies has pointed out, the Occupancy API is also covered briefly in [the programming guide](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#occupancy-calculator) and there is an Occupancy Calculator spreadsheet that ships with the CUDA toolkit. The `__launch_bounds__` [compiler directive](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds) is also tangentially related. – Robert Crovella Apr 23 '17 at 19:01
  • @talonmies, could you post an answer with that link so I can accept it and upvote it? As it is exactly what I need. – TheDutchDevil Apr 28 '17 at 20:32

0 Answers0