Given the way blocks are scheduled to run on specific streaming multiprocessor, that blocks run to completion, and that there is a maximum to the number of blocks that is scheduled to a single streaming multiprocessor at any given time. Because of register constraints, shared memory constraints or compute capability level.
As CUDA picks a certain number of blocks to execute at the same time I gather that internally there at least must be some API or formula that lets CUDA determine this at runtime. However, is this API publicly available and is it documented somewhere?
The reason for asking is that we need a buffer of the size blocks_per_sm * sm
, and due to memory constraints we would ideally like to keep this buffer as small as possible, especially if due to registry constraints we can run a lot less blocks than the maximum specified by the compute capability we would like to save that space.