New blocks can be scheduled as soon as an SM has sufficient unused resources to support the new block. It is not necessary for the SM to be completely drained of blocks before new blocks can be scheduled.
As pointed out in the comments, if you now ask for public documentation to support this assertion, I'm not sure I can point to it. However it's possible to create a test case and prove this to yourself.
In a nutshell, you would create a block-specialized kernel that would launch many blocks. The first block on each SM would discover and declare itself using atomics. These blocks would "persist" until all other blocks had completed, using a block-completed counter (again, using atomics, similar to the threadfence reduction sample code). All other blocks that are not the first to launch on a given SM would simply exit. The completion of such a code, as opposed to it hanging, would be the proof that other blocks can be scheduled even if some blocks are still resident.
Here is a fully worked example:
$ cat t743.cu
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#define NB 1000
// increase array length here if your GPU has more than 32 SMs
#define MAX_SM 32
// set HANG_TEST to 1 to demonstrate a hang for test purposes
#define HANG_TEST 0
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
static __device__ __inline__ uint32_t __smid(){
uint32_t smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
__device__ volatile int blocks_completed = 0;
// increase array length here if your GPU has more than 32 SMs
__device__ int first_SM[MAX_SM];
// launch with one thread per block only
__global__ void tkernel(int num_blocks, int num_SMs){
int my_SM = __smid();
int im_not_first = atomicCAS(first_SM+my_SM, 0, 1);
if (!im_not_first){
while (blocks_completed < (num_blocks-num_SMs+HANG_TEST));
}
atomicAdd((int *)&blocks_completed, 1);
}
int main(int argc, char *argv[]){
unsigned my_dev = 0;
if (argc > 1) my_dev = atoi(argv[1]);
cudaSetDevice(my_dev);
cudaCheckErrors("invalid CUDA device");
int tot_SM = 0;
cudaDeviceGetAttribute(&tot_SM, cudaDevAttrMultiProcessorCount, my_dev);
cudaCheckErrors("CUDA error");
if (tot_SM > MAX_SM) {printf("program configuration error\n"); return 1;}
printf("running on device %d, with %d SMs\n", my_dev, tot_SM);
int temp[MAX_SM];
for (int i = 0; i < MAX_SM; i++) temp[i] = 0;
cudaMemcpyToSymbol(first_SM, temp, MAX_SM*sizeof(int));
cudaCheckErrors("cudaMemcpyToSymbol fail");
tkernel<<<NB, 1>>>(NB, tot_SM);
cudaDeviceSynchronize();
cudaCheckErrors("kernel error");
}
$ nvcc -o t743 t743.cu
$ ./t743 0
running on device 0, with 15 SMs
$ ./t743 1
running on device 1, with 1 SMs
$ ./t743 2
I have tested the above code on linux with CUDA 7, on a K40c, C2075, and Quadro NVS 310 GPU. It doesn't hang.
To answer your second question, a block generally remains on the SM on which it was first scheduled. One possible exception is in the case of CUDA dynamic parallelism.