According to our textbook: Fermi SM can take up to 1536 threads.
Let's say now I call a kernel like this:
kernel<<<8, 1024>>>();
If the 8 blocks are all in the same SM, there won't be enough threads since 1024*8 > 1536. If now instead I call a kernel like this:
kernel<<<8, 10>>>();
Then all the blocks can fit into the same SM. (and save resources? I don't know). So why don't we need to specify if the blocks are in the same SM?