I am trying to optimize my simulator by leveraging run-time compilation. My code is pretty long and complex, but I identified a specific __device__
function whose performances can be strongly improved by removing all global memory accesses.
Does CUDA allow the dynamic compilation and linking of a single __device__
function (not a __global__
), in order to "override" an existing function?