In recent nVIDIA GPU uarchitectures, a single streaming multiprocessor seems to be broken up into 4 sub-units; with each of them having horizontal or vertical 'bars' of 8 'squares', corresponding to different functional units: integer ops, 32-bit flops, 64-bit flops, and load/store. A single warp scheduler seems to be associated with each such "quarter-SM".
Now, in the CUDA programming model, the threads of each warp (= 32 threads) are instruction-locked together. However, when actually executing work, and in a situation where, say, only the second half or latter quarter of the threads in a warp are active - can these sub-warps be scheduled to 2 or 3 quarter-SMs, with the other quarter-SM doing some other work?