1

I am trying to optimize my simulator by leveraging run-time compilation. My code is pretty long and complex, but I identified a specific __device__ function whose performances can be strongly improved by removing all global memory accesses.

Does CUDA allow the dynamic compilation and linking of a single __device__ function (not a __global__), in order to "override" an existing function?

  • I agree with the answer by @talonmies. Replacing a function that is already compiled and linked into an object is not possible, I don't think. However I think it should be possible to load a device-compiled kernel at runtime which has a call to a `__device__` function, and then choose and compile and link at runtime the specific device function you will use for that call. – Robert Crovella Aug 09 '16 at 19:41
  • Unfortunately, I cannot choose a specific function, because the code of the function would be generated according to the model to be simulated. That is, loading different models implies different equations to be evaluated in the `__device__` function. This is why I could use the JIT compilation provided by NVRTC. – Marco S. Nobile Aug 09 '16 at 19:49
  • 1
    You could choose the function after the model is loaded and you know which function you want to use ( or which function you have just generated). – Robert Crovella Aug 09 '16 at 19:52
  • @MarcoS.Nobile are you writing the rest of the kernel yourself or are you using a library? – Pavan Yalamanchili Aug 10 '16 at 19:56
  • @PavanYalamanchili: Besides the CURAND library for RNG, the whole kernel is written by me. No external libraries of any kind. – Marco S. Nobile Aug 11 '16 at 17:49

2 Answers2

2

I am pretty sure the really short answer is no.

Although CUDA has dynamic/JIT device linker support, it is important to remember that the linkage process itself is still static.

So you can't delay load a particular function in an existing compiled GPU payload at runtime as you can in a conventional dynamic link loading environment. And the linker still requires that a single instance of all code objects and symbols be present at link time, whether that is a priori or at runtime. So you would be free to JIT link together precompiled objects with different versions of the same code, as long as a single instance of everything is present when the session is finalised and the code is loaded into the context. But that is as far as you can go.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Thank you for your kind answer. Just to be sure, would you be so kind to analyze this simple example that I found online? https://groups.google.com/forum/#!topic/thrust-users/GGWw_uHPfGs It should be the JIT compilation, linking and override of a device functor passed to thrust – Marco S. Nobile Aug 09 '16 at 19:52
  • 1
    I was assuming that when you said "Does CUDA allow..." that you were talking about part of the defined/published functionality. The example you've now linked uses unpublished features in a way that may or may not be supported in the future. If you want to build your software on that, it may work, but it may also break. And if you go looking for documentation about how to do it, you won't find it. The process of replacing a linked function with another linked function at runtime that way is also essentially orthogonal to nvRTC. – Robert Crovella Aug 09 '16 at 23:12
  • Ok, got it. I will pursue a different approach (probably by recompiling the whole binary on every model change). I still think this functionality would be cool, btw :) Thank you for your clarification! – Marco S. Nobile Aug 10 '16 at 05:10
0

It looks like you have a "main" kernel with a part that is "switchable" at run time.

You can definitely do this using nvrtc. You'd need to go about doing something like this:

  • Instead of compiling the main kernel ahead of time, store it as as string to be compiled and linked at runtime.
  • Let's say the main kernel calls "myFunc" which is a device kernel that is chosen at runtime.
  • You can generate the appropriate "myFunc" kernel based on equations at run time.
  • Now you can create an nvrtc program using multiple sources using nvrtcCreateProgram.

That's about it. The key is to delay compiling the main kernel until you need it at run time. You may also want to cache your kernels somehow so you end up compiling only once.

There is one problem I foresee. nvrtc may not find the curand device calls which may cause some issues. One work around would be to look at the header the device function call is in and use nvcc to compile the appropriate device kernel to ptx. You can store the resulting ptx as text and use cuLinkAddData to link with your module. You can find more information in this section.

Pavan Yalamanchili
  • 12,021
  • 2
  • 35
  • 55
  • This is closer to what I had in mind. So, the key is that I have to compile the whole kernel - together with the new generated function - at run-time. And yes, you were right, CURAND is currently an issue and your solution is probably the most suitable approach. Thank you! – Marco S. Nobile Aug 15 '16 at 19:56