What is the behavior of thread block scheduling to specific SM's after CUDA kernel launch?

Question

My question is on the scheduling of thread blocks in CUDA (specifically kepler or newer nvidia architectures) after execution of a kernel has already begun.

From my understanding of the kepler architecture (which may be incorrect) there is a limit to the number of active blocks which can be scheduled to a single SM at any moment in time (16 blocks if I am remembering correctly). Also from my understanding blocks cannot move once they are scheduled to run on a specific SM.

What I am curious about is the block scheduling and execution behavior after the initial selection of blocks takes place and have begun executing on the device (assuming that a kernel has more thread blocks than can be active in all SMs).

Are new blocks executed as soon as a single currently running active block completes in a SM? Or is the next set of blocks executed only after a SM completes all of its currently active blocks? Or are they started only after all SMs complete all currently active blocks executing?

In addition I have heard that block scheduling is "fixed" to a single SM. I am making the assumption it is fixed to a single SM only after the block becomes active. Is this the case?

What you are referring to as "SIMD" is generally called an SM (streaming multiprocessor), or more specifically, SMX for Kepler, and SMM for Maxwell. Thread blocks are scheduled independently of each other, and are assigned to SMs as resources for their execution become available. Exact details of this scheduling are implementation-specific. You should not rely on any particular behavior. — void_ptr, May 20 '15 at 23:01
"Exact details of this scheduling are implementation-specific." How implementation specific is the actual scheduling? There are quite a few things that nvidia publicly claims are implementation specific but then in private say they are unlikely to change. — NothingMore, May 21 '15 at 00:17
If you do indeed mean SM instead of SIMD, can you edit your question to that effect? — Jez, May 21 '15 at 11:22

Robert Crovella · Accepted Answer · 2015-05-22T03:03:00.133

New blocks can be scheduled as soon as an SM has sufficient unused resources to support the new block. It is not necessary for the SM to be completely drained of blocks before new blocks can be scheduled.

As pointed out in the comments, if you now ask for public documentation to support this assertion, I'm not sure I can point to it. However it's possible to create a test case and prove this to yourself.

In a nutshell, you would create a block-specialized kernel that would launch many blocks. The first block on each SM would discover and declare itself using atomics. These blocks would "persist" until all other blocks had completed, using a block-completed counter (again, using atomics, similar to the threadfence reduction sample code). All other blocks that are not the first to launch on a given SM would simply exit. The completion of such a code, as opposed to it hanging, would be the proof that other blocks can be scheduled even if some blocks are still resident.

Here is a fully worked example:

$ cat t743.cu
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

#define NB 1000
// increase array length here if your GPU has more than 32 SMs
#define MAX_SM 32
// set HANG_TEST to 1 to demonstrate a hang for test purposes
#define HANG_TEST 0

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

static __device__ __inline__ uint32_t __smid(){
    uint32_t smid;
    asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
    return smid;}

__device__ volatile int blocks_completed = 0;
// increase array length here if your GPU has more than 32 SMs
__device__ int first_SM[MAX_SM];

// launch with one thread per block only
__global__ void tkernel(int num_blocks, int num_SMs){

  int my_SM = __smid();
  int im_not_first = atomicCAS(first_SM+my_SM, 0, 1);
  if (!im_not_first){
    while (blocks_completed < (num_blocks-num_SMs+HANG_TEST));
  }
  atomicAdd((int *)&blocks_completed, 1);
}

int main(int argc, char *argv[]){
  unsigned my_dev = 0;
  if (argc > 1) my_dev = atoi(argv[1]);
  cudaSetDevice(my_dev);
  cudaCheckErrors("invalid CUDA device");
  int tot_SM = 0;
  cudaDeviceGetAttribute(&tot_SM, cudaDevAttrMultiProcessorCount, my_dev);
  cudaCheckErrors("CUDA error");
  if (tot_SM > MAX_SM) {printf("program configuration error\n"); return 1;}
  printf("running on device %d, with %d SMs\n", my_dev, tot_SM);
  int temp[MAX_SM];
  for (int i = 0; i < MAX_SM; i++) temp[i] = 0;
  cudaMemcpyToSymbol(first_SM, temp, MAX_SM*sizeof(int));
  cudaCheckErrors("cudaMemcpyToSymbol fail");
  tkernel<<<NB, 1>>>(NB, tot_SM);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel error");
}

$ nvcc -o t743 t743.cu
$ ./t743 0
running on device 0, with 15 SMs
$ ./t743 1
running on device 1, with 1 SMs
$ ./t743 2

I have tested the above code on linux with CUDA 7, on a K40c, C2075, and Quadro NVS 310 GPU. It doesn't hang.

To answer your second question, a block generally remains on the SM on which it was first scheduled. One possible exception is in the case of CUDA dynamic parallelism.

Thank you for your codes. Are there any ways to completely 'block' some SMs (e.g. maybe by somehow keeping half of the total number of SMs fully occupied) so an application can be tested using the remaining SMs? Thank you. — Minh Nguyen, Mar 17 '21 at 03:32
Suppose you knew you had 10 SMs. Launch a kernel that has 20 blocks, each of which has 1024 threads. Make that kernel code such that if the `__smid` is < 5, then those blocks persist "indefinitely". If the `__smid` is greater than or equal to 5, those blocks persist for 1 second, then exit. You will end up with a kernel that has 2 blocks resident on each of 5 SMs, fully utilizing those SMs (in terms of resident threads or resident warps, fully preventing any other blocks from being deposited) and 5 other SMs that are "empty". Do this under CUDA MPS, and you have 5 SMs out of 10 available. — Robert Crovella, Mar 17 '21 at 04:20
I have tested as you suggested for a 10GB MIG instance of a 40GB A100 GPU; however, there was a problem described in https://stackoverflow.com/questions/66668224/intentionally-blocking-sms-in-a100-mig-instance . It would be really nice if you can provide some helps to me. Thanks in advance. — Minh Nguyen, Mar 17 '21 at 06:52

What is the behavior of thread block scheduling to specific SM's after CUDA kernel launch?

1 Answers1

Linked