JIT compilation of CUDA device functions

Question

I have a fixed kernel and I want the ability to incorporate user defined device functions to alter the output. The user defined functions will always have the same input arguments and will always output a scalar value. If I knew the user defined functions at compile time I could just pass them in as pointers to the kernel (and have a default device function that operates on the input if given no function). I have access to the user defined function's PTX code at runtime and am wondering if I could use something like NVIDIA's jitify to compile the PTX at run time, get a pointer to the device function, and then pass this device function to the precompiled kernel function.

I have seen a few postings that get close to answering this (How to generate, compile and run CUDA kernels at runtime) but most suggest compiling the entire kernel along with the device function at runtime. Given that the device function has fixed inputs and outputs I don't see any reason why the kernel function couldn't be compiled ahead of time. The piece I am missing is how to compile just the device function at run time and get a pointer to it to then pass to the kernel function.

nvrtc is the defined method to do this. It is also straightforward with the CUDA driver API. In either case you will have to use the runtime linking facility — Robert Crovella, Oct 02 '19 at 02:29
Thanks Robert! I am primarily using the runtime api. Are there issues if I use the driver API for compilation but use the runtime api to call the kernel? — Chris Uchytil, Oct 02 '19 at 04:37
I don't know how to do that if by "runtime API" you mean the traditional `<<<...>>>` kernel launch syntax. If you mean one of the `cudaLaunch...` APIs, it may be possible, I haven't looked closely at that.. You might want to start by reading the documentation for nvrtc, and also studying the sample codes. — Robert Crovella, Oct 02 '19 at 13:52
I am referring to calling kernels via <<< >>>. The cuda sample code has an example using cudalaunch I'll just follow along with that. Mostly I'm trying to avoid having to manage to cuda context, etc. with the driver api. — Chris Uchytil, Oct 02 '19 at 15:07
You may not be able to use the runtime API for this as the image containing your kernel would not be aware of your new function. However, compiling with --keep, and reusing the cubin in a runtime compilation phase with your ptx to be jitted works (at least with cuda 10). You need to expose the jitted function pointer somehow for this to work. — Florent DUGUET, Oct 03 '19 at 18:03

score 2 · Accepted Answer · answered Oct 04 '19 at 06:49

2

You can do that doing the following:

Generate your cuda project with --keep, and look-up the generated ptx or cubin for your cuda project.
At runtime, generate your ptx (in our experiment, we needed to store the function pointer in a device memory region, declaring a global variable).
Build a new module at runtime starting with cuLinkCreate, adding first the ptx or cubin from the --keep output and then your runtime generated ptx with cuLinkAddData.
Finally, call your kernel. But you need to call the kernel using the freshly generated module and not using the <<<>>> notation. In the later case it would be in the module where the function pointer is not known. This last phase should be done using driver API (you may want to try runtime API cudaLaunchKernel also).

The main element is to make sure to call the kernel from the generated module, and not from the module that is magically linked with your program.

answered Oct 04 '19 at 06:49

Florent DUGUET

2,786
16
28

Thanks Florent. Quick question regarding the linking. If I have the ptx of the kernel and the ptx of the device function, do I need to prevent any sort of name mangling? As long as the name of the device function matches the name of the function being called in the kernel am I good to go when linking? – Chris Uchytil Oct 04 '19 at 17:19
1

Mangling should not be an issue. Both should be mangled the same way if they use same parameters. – Florent DUGUET Oct 04 '19 at 21:25
Really helpful! Just got it working. Not sure about your use case specifically, but you can throw your device function in a header file that the global function imports. This puts a .extern description for the device function in the global ptx file. Then you do cuLinkAddData or cuLinkAddFile for both the device.ptx and the global.ptx. No need for the function pointer in device memory. Edit: left out, I still had to make a device.cu and a global.cu for the implementation of each, with descriptors like usual in the .cuh files. – Charles Durham Jun 14 '20 at 22:16

score 1 · Answer 2 · answered Oct 03 '19 at 11:34

1

I have access to the user defined function's PTX code at runtime and am wondering if I could use something like NVIDIA's jitify to compile the PTX at run time, get a pointer to the device function, and then pass this device function to the precompiled kernel function.

No, you cannot do that. NVIDIA's APIs do not expose device functions, only complete kernels. So there is no way to obtain runtime compiled device pointers.

You can perform runtime linking of a pre-compiled kernel (PTX or cubin) with device functions you runtime compile using NVRTC. However, you can only do this via the driver module APIs. That functionality is not exposed by the runtime API (and based on my understanding of how the runtime API works it probably can't be exposed without some major architectural changes to the way embedded statically compiled code is injected at runtime).

answered Oct 03 '19 at 11:34

talonmies

70,661
34
192
269

Hey talonmies, thanks for the response. My hope was that I could use `cuModuleLoadDataEx`, then grab the function with `cuModuleGetFunction` and pass it in as a pointer. It sounds like from what you are saying this isn't possible. If I have the ptx code for a kernel function and the ptx code for the device function do I just concatenate the strings together or something and then run `cuModuleLoadDataEx`? What is the process of runtime linking the kernel and the device code? – Chris Uchytil Oct 03 '19 at 16:22
It looks like `cuLinkAddData` is what I should be using here. Do I just use it for each ptx string and then use `cuLinkCreate` to link the two? – Chris Uchytil Oct 03 '19 at 16:31
@ChrisUchytil I was wondering if you were able to make this work. If yes, would be great if you could share some pointers about the same. Thanks! – CodeCollector Jun 12 '21 at 15:39
@CodeCollector unfortunately it isn't quite as simple as I hoped it would be. When generating a module you do not have access to device function pointers, only kernel function pointers. On top of that I don't believe a kernel could even call a device function from another module even if you could get access to the device function pointers meaning a recompilation is always necessary. What I had to do was use NVRTC to jit compile the kernel function which calls the device functions along with the code for the device function each time a new device function is created. – Chris Uchytil Jun 15 '21 at 18:19
@CodeCollector: It isn't possible. There is no dynamic device code linkage (note runtime and dynamic are not the same thing, even JIT/runtime linkage is static) and the CUDA APIs themselves don't expose device functions, only kernels and static device symbols. – talonmies Jun 16 '21 at 02:16
@ChrisUchytil thanks for the clarification! – CodeCollector Jun 18 '21 at 23:39

JIT compilation of CUDA __device__ functions

2 Answers2

JIT compilation of CUDA device functions