10

I need to modify the PTX code and compile it directly. The reason is that I want to have some specific instructions right after each other and it is difficult to write a cuda code that results my target PTX code, So I need to modify ptx code directly. The problem is that I can compile it to (fatbin and cubin) but I dont know how to compile those (.fatbin and .cubin) to "X.o" file.

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
user2998135
  • 111
  • 1
  • 1
  • 3
  • 2
    If your usage of PTX is relatively narrow, to achieve specific instruction sequencing, you may also want to consider using inline PTX. There is a CUDA [sample code](http://docs.nvidia.com/cuda/cuda-samples/index.html#using-inline-ptx) as well as a supporting [reference document](http://docs.nvidia.com/cuda/inline-ptx-assembly/index.html). These methods would allow you to avoid the driver API entirely if you wanted to. – Robert Crovella Nov 16 '13 at 01:48
  • I second Robert Crovella's recommendation to look at inline PTX. For small to medium sized pieces of code, I find that inline PTX is often the easiest and fairly painless way to achieve more control over the generated code (since PTX is compiled, complete control is not possible). Depending on your use case, consider writing a simple task-specific PTX code generator in the programming language of your choice, I have used that for a few of my own projects. – njuffa Nov 16 '13 at 18:15

6 Answers6

11

There may be a way to do this with an orderly sequence of nvcc commands, but I'm not aware of it and haven't discovered it.

One possible approach however, albeit messy, is to interrupt and restart the cuda compilation sequence, and edit the ptx file in the interim (before the restart). This is based on information provided in the nvcc manual, and I would not consider this a standard methodology, so your mileage may vary. There may be any number of scenarios that I haven't considered where this doesn't work or isn't feasible.

In order to explain this I shall present an example code:

#include <stdio.h>

__global__ void mykernel(int *data){

  (*data)++;
}

int main(){

  int *d_data, h_data = 0;
  cudaMalloc((void **)&d_data, sizeof(int));
  cudaMemcpy(d_data, &h_data, sizeof(int), cudaMemcpyHostToDevice);
  mykernel<<<1,1>>>(d_data);
  cudaMemcpy(&h_data, d_data, sizeof(int), cudaMemcpyDeviceToHost);
  printf("data = %d\n", h_data);
  return 0;
}

For this purpose, I am dispensing with cuda error checking and other niceties, in favor of brevity.

Ordinarily we might compile the above code as follows:

nvcc -arch=sm_20 -o t266 t266.cu 

(assuming the source file is named t266.cu)

Instead, based on the reference manual, we'll compile as follows:

nvcc -arch=sm_20 -keep -o t266 t266.cu

This will build the executable, but will keep all intermediate files, including t266.ptx (which contains the ptx code for mykernel)

If we simply ran the executable at this point, we'd get output like this:

$ ./t266
data = 1
$

The next step will be to edit the ptx file to make whatever changes we want. In this case, we'll have the kernel add 2 to the data variable instead of adding 1. The relevant line is:

    add.s32         %r2, %r1, 2;
                              ^
                              |
                          change the 1 to a 2 here

Now comes the messy part. The next step is to capture all the intermediate compile commands, so we can rerun some of them:

nvcc -dryrun -arch=sm_20 -o t266 t266.cu --keep 2>dryrun.out

(Using linux redirection of stderr here). We then want to edit that dryrun.out file so that:

  1. we retain all the commands after the creation of the ptx file, up to the end of the file. The line that creates the ptx file will be evident as the one which specifies -o "t266.ptx"
  2. we strip out the leading #$ that each line begins with, so in effect we are creating a script.

When I perform the above 2 steps, I end up with a script like this:

ptxas  -arch=sm_20 -m64  "t266.ptx"  -o "t266.sm_20.cubin"
fatbinary --create="t266.fatbin" -64 --key="xxxxxxxxxx" --ident="t266.cu" "--image=profile=sm_20,file=t266.sm_20.cubin" "--image=profile=compute_20,file=t266.ptx" --embedded-fatbin="t266.fatbin.c" --cuda
gcc -D__CUDA_ARCH__=200 -E -x c++   -DCUDA_DOUBLE_MATH_FUNCTIONS   -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT "-I/usr/local/cuda/bin/..//include"   -m64 -o "t266.cu.cpp.ii" "t266.cudafe1.cpp"
gcc -c -x c++ "-I/usr/local/cuda/bin/..//include"   -fpreprocessed -m64 -o "t266.o" "t266.cu.cpp.ii"
nvlink --arch=sm_20 --register-link-binaries="t266_dlink.reg.c" -m64   "-L/usr/local/cuda/bin/..//lib64" "t266.o"  -o "t266_dlink.sm_20.cubin"
fatbinary --create="t266_dlink.fatbin" -64 --key="t266_dlink" --ident="t266.cu " -link "--image=profile=sm_20,file=t266_dlink.sm_20.cubin" --embedded-fatbin="t266_dlink.fatbin.c"
gcc -c -x c++ -DFATBINFILE="\"t266_dlink.fatbin.c\"" -DREGISTERLINKBINARYFILE="\"t266_dlink.reg.c\"" -I. "-I/usr/local/cuda/bin/..//include"   -m64 -o "t266_dlink.o" "/usr/local/cuda/bin/crt/link.stub"
g++ -m64 -o "t266" -Wl,--start-group "t266_dlink.o" "t266.o"   "-L/usr/local/cuda/bin/..//lib64" -lcudart_static  -lrt -lpthread -ldl  -Wl,--end-group

Finally, execute the above script. (in linux you can make this script file executable using chmod +x dryrun.out or similar.) If you haven't made any mistakes while editing the .ptx file, the commands should all complete successfully, and create a new t266 executable file.

When we run that file, we observe:

$ ./t266
data = 2
$

Indicating that our changes were successful.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Got this error sh: 1: bin2c: not found In file included from t266.cudafe1.stub.c:1:0: t266.cudafe1.stub.c: In function ‘void __sti____cudaRegisterAll_12_t266_cpp1_ii_ea754d28()’: t266.cudafe1.stub.c:2:126: error: ‘__fatDeviceText’ was not declared in this scope #include "crt/host_runtime.h" sh: 1: bin2c: not found crt/link.stub: In function ‘void __cudaRegisterLinkedBinary(const __fatBinC_Wrapper_t*, void (*)(void**), void*)’: crt/link.stub:102:60: error: ‘__fatDeviceText’ was not declared in this scope __cudaFatCubinHandle = __cudaRegisterFatBinary((void*)&__fatDeviceText); – Ginu Jacob Jun 20 '16 at 09:49
  • 1
    The method still works for me. [Here](http://pastebin.com/aM9DQptV) is a complete console session covering every step of the process. Note that you can't use the "script" I have in the question -- that won't work. The temporary file names change with every run. You must edit your own script from the dryrun.out file for this to work. – Robert Crovella Jun 22 '16 at 00:57
  • I tried your solution once again. I use the same example as t266.cu and made sure that the file names generated are correct according to the script. The problem I am getting is "sh: bin2c: command not found" and further to this getting an error in some internal function "__cudaRegisterLinkedBinary" the error is "error: '__fatDeviceText' was not declared in this scope". The change made to the PTX is exactly same as that of the demo. It seems the problem is somewhere around "bin2c" which is used somewhere used internally – Ginu Jacob Jun 23 '16 at 04:52
  • What CUDA version are you using? Provide a complete transcript, just as I have done. – Robert Crovella Jun 23 '16 at 04:55
  • The version used is CUDA 7.5.18 – Ginu Jacob Jun 23 '16 at 05:03
  • I don't have any trouble running `bin2c` from the command line on my machine. Perhaps your CUDA install is broken. When I do: `bin2c --version` I get sensible output: `bin2c: NVIDIA (R) File to C data converter ...` – Robert Crovella Jun 23 '16 at 05:03
  • No version of CUDA is placed as default. For example to do NVCC compilation, I use `/usr/local/cuda-7.5/nvcc ` instead of `nvcc `. Similarly in the script I had modified the path to `ptxas`, `fatbin` and `nvlink`. If I try "bin2c --version" I will get "command not found". Instead I need to do `/usr/local/cuda-7.5/bin/bin2c --version` this gives me: `bin2c: NVIDIA (R) File to C data converter Copyright (c) 2005-2015 NVIDIA Corporation Built on Tue_Aug_11_14:27:20_CDT_2015 Cuda compilation tools, release 7.5, V7.5.17` – Ginu Jacob Jun 23 '16 at 05:14
  • Somewhere in the script I may need to specify the correct path to bin2c. But the bin2c is not specified anywhere in the script – Ginu Jacob Jun 23 '16 at 05:15
  • 1
    Thats why it's not working. A proper CUDA install has the path to `nvcc` placed into the PATH environment variable. Read [this](http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#post-installation-actions) where it states: "The PATH variable **needs to include** /usr/local/cuda-7.5/bin" Notice this is listed in the section entitled **Mandatory Actions** – Robert Crovella Jun 23 '16 at 05:16
3

Usually, when handling with cubin or ptx-files one uses the CUDA Driver API and not the Runtime API; doing so, you load the ptx or cubin file manually at runtime with cuModuleLoadDataEx. If you want to stick with the Runtime API you need to mimic manually what NVCC does, but this is not (entirely) documented. I only found this Nvidia forum entry on how to do this.

kunzmi
  • 1,024
  • 1
  • 6
  • 8
  • 1
    In addition, there is the CUDA [ptx JIT sample](http://docs.nvidia.com/cuda/cuda-samples/index.html#ptx-just-in-time-compilation) that demonstrates how to use the driver API to load PTX and also shows how it can interoperate with the runtime API. – Robert Crovella Nov 16 '13 at 01:46
0

You can load cubin or fatbin at runtime using cuModuleLoad* functions in CUDA: Here's the API

You can use it to include PTX into your build, though the method is somewhat convoluted. For instance, suricata compiles its .cu files into PTX files for different architectures and then converts them into an .h file that contains PTX code as a 'C' array, and then just includes it from one of the files during the build.

ArtemB
  • 3,496
  • 17
  • 18
0

I am rather late but GPU Lynx does exactly that: take a CUDA fat binary, parse the PTX, and modify it before emitting the result to the driver for execution on a GPU. You can optionally print out the modified PTX as well.

omikun
  • 273
  • 1
  • 2
  • 14
0

This sequence of nvcc commands seems to do the trick. Please see here for more details.

Create your ptx files to modify

nvcc file1.cu file2.cu file3.cu -rdc=true --ptx

Link ptx files into an object file

nvcc file1.ptx file2.ptx file3.ptx -dlink

I did this on Windows so it popped out a_dlink.obj. As the documentation points out host code has been discarded by this point. Run

nvcc file1.cu file2.cu file3.cu -rdc=true --compile

to create object files. They will be .obj for Windows or .o for Linux. Then create a library output file

nvcc file1.obj file2.obj file3.obj a_dlink.obj --lib -o myprogram.lib

Then run

nvcc myprogram.lib

which will pop out an exectuable a.exe on Windows or a.out on Linux. This procedure works for cubin and fatbin files too. Just substitute those names in place of ptx.

  • I tried the suggested compilation pipeline, but it didn't work for me. Despite making changes to the .ptx files, they were not reflected in the final executable. – C. Flint Apr 19 '23 at 09:14
0

You can use NVTRC - it's easy!

Expanding on @ArtemB's answer:

nVIDIA offers a real-time compilation (RTC) library. There's an example of how it's used as part of the CUDA samples; you can access it here.

The sample actually starts from CUDA code, but the intermediary step is creating a PTX code as a plain C string (`char *). From there, this is what you do, basically:

char* ptx;
size_t ptxSize;

// ... populate ptx and ptxSize somehow ...

CUcontext context;
CUdevice cuDevice;

// These next few lines simply initialize your work with the CUDA driver,
// they're not specific to PTX compilation
cuInit(0);
cuDeviceGet(&cuDevice, 0); // or some other device on your system
cuCtxCreate(&context, 0, cuDevice);

// The magic happens here:
CUmodule module;
cuModuleLoadDataEx(&module, ptx, 0, 0, 0));

// And here is how you use your compiled PTX
CUfunction kernel_addr;
cuModuleGetFunction(&kernel_addr, module, "my_kernel_name");
cuLaunchKernel(kernel_addr, 
   // launch parameters go here
   // kernel arguments go here
);

Notes:

  • I've removed all error checks so as not to clutter the example code - but do check for errors in your code.
  • You'll need to link your program with the NVRTC library - it's separate from the main CUDA and CUDA driver libs. On linux, it's called libnvrtc.so.
einpoklum
  • 118,144
  • 57
  • 340
  • 684