I have some kernels that I have written in both OpenCL and CUDA. When running OpenCL programs in the AMD profiler, it allows me to view the assembly code of the kernel. I would like to compare this with the assembly code of the CUDA kernels to compare compiler optimizations between the two languages. I have been playing around with the Nvidia Profiler, but am still at a loss on how to get the assembly code of my kernels. How does one go about doing this?
-
I'm not familiar enough with GPGPU to make this an answer, but I suspect there's no useful comparison to be made here. AMD and Nvidia parts are sufficiently different that they probably don't even use the same assembly language. – Dec 09 '13 at 23:07
-
I realize there will be significant differences, but there is a specific optimization I'm looking to compare. I just really need to look over the assembly, regardless of differences, and should be able to identify relative similarities. – PseudoPsyche Dec 09 '13 at 23:26
-
1$.02 says you won't get any official info even when signing away an arm and a leg in a NDA. See also http://stackoverflow.com/questions/7353136/is-there-an-assembly-language-for-cuda and http://stackoverflow.com/questions/9798258/what-is-sass-short-for – nos Dec 09 '13 at 23:32
-
@PseudoPsyche: Even if you can get CUDA to emit some assembly, the differences between that and ATI's assembly are likely to be so vast that you won't be able to identify any similarities at all. – Dec 10 '13 at 00:05
-
@nos Thanks for that second link! Turns out that's what I was looking for! – PseudoPsyche Dec 10 '13 at 00:31
-
@duskwuff turns out PTX is what I was looking for. It gives me enough information to make the comparison I was looking for. – PseudoPsyche Dec 10 '13 at 00:31
-
Please move the solution to an answer and remove it from the question to respect the site conventions. – einpoklum Dec 10 '13 at 06:16
-
In CUDA 5.5 and above you can use nvdisasm (replace cuobjump) to get the SASS for kernels. Nsight VSE >= 3.1 and Visual Profiler >= 5.5 can also show the SASS as well as collect per instruction statistics. – Greg Smith Jul 01 '14 at 00:00
2 Answers
As mentioned by turboscrew, the closest thing to assembly for CUDA is the PTX code. I thought it would be more useful to add to this answer the method of actually generating the PTX code.
This can be generated in the following way:
nvcc -ptx -o kernel.ptx kernel.cu
Where kernel.cu
is your source file and kernel.ptx
is the destination PTX file.
Also, here is a link to NVidia's PTX documentation:
http://docs.nvidia.com/cuda/parallel-thread-execution/index.html
If you have some assembly knowledge, most of it is fairly straightforward. There are some special functions that may be used where it would be useful to look them up for more details though.

- 4,332
- 5
- 37
- 58
-
8"the closest thing to assembly for CUDA is the PTX code" is wrong. The assembly can be inspected directly using [the cuobjdump tool](http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#axzz36AnCbaAh). – Robert Crovella Jul 01 '14 at 04:06
-
@RobertCrovella Is there a way to go directly from the .cu source to assembly for the current GPU with nvcc? I'm mostly interested in register usage, but if my understanding is correct PTX is an SSA format. – Todd Sewell Mar 29 '22 at 13:08
-
1Yes, that is what `nvcc` does. It compiles .cu source to SASS. Compile to a binary format (such as an executable) then use the `cuobjdump` utility on the executable. There are numerous questions here on the `cuda` tag discussing this. If you want to see register usage, there are other [binary utilities](https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html) that can help with that. – Robert Crovella Mar 29 '22 at 13:35
You want to read PTX? You don't get any closer to the assembly. vidia hasn't published the assembly of its GPUs. The "assembly" is PTX and it's "pseudo assembly" executed by a bytecode interpreter in the driver.

- 676
- 4
- 13
-
Thanks! Yes, the PTX code was exactly what I wanted! I had read about PTX but didn't realize that it was actually what I was looking for. I thought there was another level down that was actual assembly or something. – PseudoPsyche Dec 10 '13 at 00:29
-
12Err CUDA has been shippeing with an official tool called cuobjdump which will disassemble the actual binary machine code emitted by the assembler. Any object file, cubin, library or application can be processed in this way. Also PTX isn't executed by a "byte code interpreter" in the driver, there is no such thing. PTX is assembled into microcode using a traditional assembler (called ptxas, also shipping in every toolkit) and runs "on the metal" in the GPU. Just about everything in this answer is wrong, I am afraid. – talonmies Dec 10 '13 at 01:31
-
From the answer to my question some time back, I had a different understanding: JIT-compilation. https://devtalk.nvidia.com/default/topic/551214/gpu-assembly/ To my understanding, the machine code differs from GPU to GPU so much that it makes no sense trying to learn it. Looks like nVidia is not promising any machine code level compatibilities. – turboscrew Dec 10 '13 at 08:10
-
1Jit compilation is just the driver running the assembler on PTX code at runtime. There isn't anything like Android Dalvik or the Java VM here. And NVIDIA *ship* a document describing the machine code for the most recent architectures. It is true that the original Telsa instruction set differs a little from Fermi and Kepler, but the latter two (representing about 4 years worth of hardware designs) are rather evolutionary, even if the silicon itself has changed a lot – talonmies Dec 10 '13 at 08:27
-
"And NVIDIA ship a document describing the machine code" - Darn. I once specifically asked for the machine code specs, but I got the reply: "no can do". AMD has published its machine code of the main architectures. – turboscrew Dec 10 '13 at 12:11