Can I combine a "static" CUDA kernel launch with PTX code and get a working binary?

Question

Suppose I take a CUDA program - for example the CUDA vectorAdd sample, and cut out the kernel's implementation, but still have the launch command:

vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);

and suppose that I write my own PTX since I'm a DIY kind of a guy, so that now I have vectorAdd.cu without the kernel's CUDA code and vectorAdd.ptx.

Can I now generate an executable which will work like the unmodified vectorAdd would, but running the code in the PTX?

(Assume for the same of discussion that the PTX doesn't try anything funny or do anything wrong.)

Notes:

This question is a variant on:

How can I create an executable to run a kernel in a given PTX file?

Except that, in that question, the poster was willing to use the driver API to dynamically load and compile a PTX file using the driver API. Here, that's not an option: The C++ code uses a triple-chevron CUDA runtime launch, and this must not change.
I don't mind the process of creating the executable involving the generation of other files, e.g. a cubin.

I don't understand your question (or at the rationale for it) at all. Two things: the other question *isn't* using nvrtc at all (that is a real time device C++ compiler), and what you seem to be describing is the basic runtime API PTX JIT functionality, which has existed since the first CUDA beta release. As far as I can tell, what you are asking is *exactly* the same as the question you linked to. What different answer are you expeecting? — talonmies, May 30 '21 at 01:52
@talonmies: See edit of last paragraph. You're right about NVRTC, the other question doesn't use it. — einpoklum, Jun 02 '21 at 08:32
So you are asking if you can do JIT with the runtime API? Why not actually ask that question? Anyway the answer is one word -- no. — talonmies, Jun 02 '21 at 09:04
@talonmies: No, I don't want to do any JIT. I want to compile ahead-of-time - but instead of the kernel code being C++/CUDA, the kernel code is PTX. — einpoklum, Jun 02 '21 at 09:34
The GPU doesn't run PTX and what you want makes no sense. There must be a compilation/assembly phase at some point. So when you say "PTX" you really mean a cubin file? — talonmies, Jun 02 '21 at 09:40
@talonmies: I literally said "I want to compile", so yes, compilation. But AOT rather than JIT. Also, no dynamically loading a cubin file at runtime: The host-side code stays as it is. — einpoklum, Jun 02 '21 at 09:46
@talonmies: I don't mind a cubin file being involved in the process, but that's a detail. I want to end up with an executable file which runs the existing C++ program and invokes PTX version of vectorAdd. — einpoklum, Jun 02 '21 at 09:51
So PTX is actually irrelevant to the question as well. You want to load precompiled object code and launch it with the runtime API. The answer is still No. — talonmies, Jun 02 '21 at 09:57
@talonmies: 1. It's not irrelevant, in that PTX is the input for the use-case. 2. It's not "still no", because before, you could not understand what I meant by the question and closed it. If the answer is now "no", please reopen / cast a reopen vote, then answer "no". — einpoklum, Jun 02 '21 at 10:00
Is inline PTX an option? https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html You would need a vectorAdd stub — Sebastian, Jun 02 '21 at 16:48
@Sebastian: A stub in a separate file? Hmm. Can you write an answer based on that? — einpoklum, Jun 02 '21 at 17:07
I could do; or would you prefer an answer based on changing the compile chain - have a look at https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#cuda-compilation-trajectory and try out the -keep option to see the generated files for your own example programs. E.g. the cudafe1.stub.c files with __cudaLaunchPrologue, __cudaSetupArgSimple, __cudaLaunch calls. If you change files between the compile phases, you could fully exchange the generated ptx files with your own. Whereas the inline ptx solution only allows you to put ptx code within a (possibly otherwise empty function). — Sebastian, Jun 02 '21 at 18:12
And also have a look at __nv_cudaEntityRegisterCallback with __cudaRegisterEntry calls (inside the stub files) — Sebastian, Jun 02 '21 at 18:16

Sebastian · Answer 1 · 2022-06-04T13:10:36.887

EDIT/Update

I had to change some of the procedures for a newer toolkit, see below for the old description.

Some details seem to have changed with new Toolkit versions and I did create a bit more documentation on the way.

Tested on Windows with VS 2022 19.31.31104 with a sm_75 Turing GPU and nvcc 11.7, doing a debug build. It should also work on Linux and/or with other host compilers.

Step 1 and 2

We can start with full kernels and patch the PTX or SASS or we can create stub kernels.

Using here the example project from the VS Cuda Plugin, both the main() and an addWithCuda() helper function for the host and an addKernel() global device kernel are included in a single file kernel.cu.

Step 3

Copying the command line from Visual Studio, removing the paths and adding the -keep option, we also remove -x cu and the output file -o kernel.cu.obj to have to change less later on:

nvcc -gencode=arch=compute_75,code=\"sm_75,compute_75\" --use-local-env -ccbin "D:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.31.31103\bin\HostX64\x64" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include"  -G  --keep -maxrregcount=0  --machine 64 --compile -cudart static  -g  -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Fdvc143.pdb /FS /Zi /RTC1 /MDd" kernel.cu

This command should be run on the command line with x64 paths for the Visual Studio compiler.

The following files are generated:

kernel.cpp1.ii (raw code after preprocessor, different prepr. macros)
kernel.cpp1.ii.res (command line parameters for preprocessor)
kernel.cpp4.ii (raw code after preprocessor, different prepr. macros)
kernel.cpp4.ii.res (command line parameters for preprocessor)
kernel.cu.obj (compiled code from C++ and Cuda)
kernel.cu.obj.res (command line parameters)
kernel.cudafe1.c (runtime stub, includes kernel.cudafe1.stub.c)
kernel.cudafe1.cpp (C++ code after preprocessor, Cuda code only as declaration (otherwise #if 0) and launches with __cudaPushCallConfiguration)
kernel.cudafe1.gpu (Cuda code after preprocessor, C++ code removed)
kernel.cudafe1.stub.c (runtime stub to register and launch the kernel)
kernel.fatbin (fatbin of the kernel)
kernel.fatbin.c (dump of kernel.fatbin as C array)
kernel.module_id (ID identifying the kernel)
kernel.ptx (The ptx from the original .cu)
kernel.sm_75.cubin (The compiled ptx file from the original .cu)
vc143.pdb (Visual Studio program database - host stuff)

Important to keep are kernel.cudafe1.cpp and kernel.cudafe1.stub.c enabling the registering and runtime launch of the kernel.

Step 4

The kernel.ptx can be used as a structure for recreating PTX code or the kernel.sm_75.cubin or directly the kernel.fatbin.c for patching the SASS code.

Step 5

Now we are assembling the .ptx file with -fatbin -dlink and change the input file from kernel.cu to kernel.ptx:

nvcc -gencode=arch=compute_75,code=\"sm_75,compute_75\" --use-local-env -ccbin "D:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.31.31103\bin\HostX64\x64" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include"  -G  --keep -maxrregcount=0  --machine 64 -cudart static  -g  -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Fdvc143.pdb /FS /Zi /RTC1 /MDd " -fatbin -dlink kernel.ptx

We get

a_dlink.fatbin
a_dlink.sm_75.cubin
a_dlink.sm_75.cubin.optf
kernel.fatbin
kernel.fatbin.c (fatbin dump from new ptx)
kernel.obj
kernel.obj.res
kernel.sm_75.cubin
vc143.pdb

Note: Instead of a .ptx file also a .cubin file can be made to a .fatbin.c with -fatbin -dlink, but then I had to specify a non-virtual compute architecture e.g. by --gpu-architecture=sm_75.

Now we have the host files together: The kernel.fatbin.c and the kernel.cudafe1.cpp and kernel.cudafe1.stub.c.

Steps 6 to 8

We can put those through host compilation

nvcc -gencode=arch=compute_75,code=\"sm_75,compute_75\" --use-local-env -ccbin "D:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.31.31103\bin\HostX64\x64" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include"  -G  --keep -maxrregcount=0  --machine 64 --compile -cudart static  -g  -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Fdvc143.pdb /FS /Zi /RTC1 /MDd" --compile kernel.cudafe1.cpp input\kernel.fatbin.c

Step 9

and through the host linker

link /OUT:"kernel.exe" /MANIFEST /NXCOMPAT /DYNAMICBASE "cudart_static.lib" "kernel32.lib" "user32.lib" "gdi32.lib" "winspool.lib" "comdlg32.lib" "advapi32.lib" "shell32.lib" "ole32.lib" "oleaut32.lib" "uuid.lib" "odbc32.lib" "odbccp32.lib" "cudart.lib" "cudadevrt.lib" /DEBUG /MACHINE:X64 /INCREMENTAL /SUBSYSTEM:CONSOLE /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /ERRORREPORT:PROMPT /NOLOGO /LIBPATH:"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\lib\x64" /TLBID:1 kernel.cudafe1.obj kernel.fatbin.obj

Finished

creating kernel.exe.

Old Description

Perhaps still relevant with other toolkit versions.

Define your function as

__ global __ void vectorAdd(void* d_A, void* d_B, void* d_C, int numElements);

in a header visible to the caller

Create a file vectorAdd.cu with an empty declaration

__ global __ void vectorAdd(void* d_A, void* d_B, void* d_C, int numElements) {}

Call
```
nvcc --keep vectorAdd.cu
```
with suitable options
Replace vectorAdd.ptx with your version
Call
```
nvcc -fatbin -dlink
```
to create the fatbin and cubin files
Call nvcc -link to link the .cubin file and the .cudafe1.cpp or cudafe1.c (depending on language) file. They also include the .cudafe1.stub.c and the .fatbin.c file in turn
Use the resulting .obj or .o file (Windows/Linux) in your project
Call vectorAdd<<<>>> in the CUDA runtime way

(As advanced DIY guy you will want to write SASS code in the future, which is the device-specific lower-level assembly language.)

On the ptx file (added it), you can also specify the Cuda architecture — Sebastian, Jun 03 '21 at 13:00
Ok, how about the `nvcc -link`? Which arguments will it take? Not the `.c` or `.cpp` files I would guess. — einpoklum, Jun 03 '21 at 13:31
nvcc -link compiles and links host and device code. It can call a specified C++ compiler for host code. The .cpp and .c files were generated in step 3, the .fatbin.c (which is included by the .cudafe1.cpp) is generated in step 5. You can use the same arguments as your usual makefile/IDE does for compiling the .cu files, as input you specify the mentioned .cpp/.c files. The kernel resides in the fatbin.c and is automatically loaded by the runtime API. — Sebastian, Jun 03 '21 at 13:42

Can I combine a "static" CUDA kernel launch with PTX code and get a working binary?

1 Answers1