I was studying about the CUDA programming structure and what I felt after studying is that; after creating the blocks and threads, each of this blocks is assigned to each of the streaming multiprocessor (e.g. I am using GForce 560Ti which has14 streaming multiprocessors and so at one time 14 blocks can be assigned to all the streaming multiprocessors). But as I am going through several online materials such as this one :
http://moss.csc.ncsu.edu/~mueller/cluster/nvidia/GPU+CUDA.pdf
where it has been mentioned that several blocks can be run concurrently on one multiprocessor. I am basically very much confused with the execution of the threads and the blocks on the streaming multiprocessors. I know that the assignment of blocks and the execution of the threads are absolutely arbitrary but I would like how the mapping of the blocks and the threads actually happens so that the concurrent execution could occur.