Say, dynamic analysis was done on a CUDA program such that certain threads were better off being in the same warp.
For example, let's pretend we have 1024 cuda threads and a warp size of 32. After dynamic analysis we find out that threads 989, 243, 819, ..., 42 (32 total threads listed) should be on the same warp. We determined that they should be on the same warp because they have little to no divergence in code execution -- (they may not necessarily have been on the same warp when performing dynamic analysis of the CUDA program).
Is there a way to control thread to warp scheduling in CUDA? If not, is there another GPU programming language that would offer this explicit warp scheduling. If not, what could be done (possibly even a very low level approach to solve this problem)? I am hoping there is at least an answer to this last question as that is probably how CUDA was implemented -- unless warp scheduling is done at the hardware level, which would be unfortunate. Thanks!