43

I've recently gotten my head around how NVCC compiles CUDA device code for different compute architectures.

From my understanding, when using NVCC's -gencode option, "arch" is the minimum compute architecture required by the programmer's application, and also the minimum device compute architecture that NVCC's JIT compiler will compile PTX code for.

I also understand that the "code" parameter of -gencode is the compute architecture which NVCC completely compiles the application for, such that no JIT compilation is necessary.

After inspection of various CUDA project Makefiles, I've noticed the following occur regularly:

-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_21,code=sm_21

and after some reading, I found that multiple device architectures could be compiled for in a single binary file - in this case sm_20, sm_21.

My questions are why are so many arch / code pairs necessary? Are all values of "arch" used in the above?

what is the difference between that and say:

-arch compute_20
-code sm_20
-code sm_21

Is the earliest virtual architecture in the "arch" fields selected automatically, or is there some other obscure behaviour?

Is there any other compilation and runtime behaviour I should be aware of?

I've read the manual, http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-compilation and I'm still not clear regarding what happens at compilation or runtime.

Seanny123
  • 8,776
  • 13
  • 68
  • 124
James Paul Turner
  • 791
  • 3
  • 8
  • 23

2 Answers2

57

Roughly speaking, the code compilation flow goes like this:

CUDA C/C++ device code source --> PTX --> SASS

The virtual architecture (e.g. compute_20, whatever is specified by -arch compute...) determines what type of PTX code will be generated. The additional switches (e.g. -code sm_21) determine what type of SASS code will be generated. SASS is actually executable object code for a GPU (machine language). An executable can contain multiple versions of SASS and/or PTX, and there is a runtime loader mechanism that will pick appropriate versions based on the GPU actually being used.

As you point out, one of the handy features of GPU operation is JIT-compile. JIT-compile will be done by the GPU driver (does not require the CUDA toolkit to be installed) anytime a suitable PTX code is available but a suitable SASS code is not. The definition of a "suitable PTX" code is one which is numerically equal to or lower than the GPU architecture being targeted for running the code. To pick an example, specifying arch=compute_30,code=compute_30 would tell nvcc to embed cc3.0 PTX code in the executable. This PTX code could be used to generate SASS code for any future architecture that the GPU driver supports. Currently this would include architectures like Pascal, Volta, Turing, etc. assuming the GPU driver supports those architectures.

One advantage of including multiple virtual architectures (i.e. multiple versions of PTX), then, is that you have executable compatibility with a wider variety of target GPU devices (although some devices may trigger a JIT-compile to create the necessary SASS).

One advantage of including multiple "real GPU targets" (i.e. multiple SASS versions) is that you can avoid the JIT-compile step, when one of those target devices is present.

If you specify a bad set of options, it's possible to create an executable that won't run (correctly) on a particular GPU.

One possible disadvantage of specifying a lot of these options is code size bloat. Another possible disadvantage is compile time, which will generally be longer as you specify more options.

It's also possible to create excutables that contain no PTX, which may be of interest to those trying to obscure their IP.

Creating PTX suitable for JIT should be done by specifying a virtual architecture for the code switch.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • 1
    Apologies for the late reply, and thanks for yours. I understand the purpose of having PTX to JIT compile for many real architectures, but is it necessary to include all such older PTX architecturesk, or just the minimum specification PTX? For example, if I wanted the code to be run on as many GPU's as possible, would I include, say, -arch compute_11, 12 13 ... 30, 35, or simply include -arch compute_11? Best, James. – James Paul Turner Jul 13 '13 at 11:12
  • 4
    You could specify just `-arch compute_11` and you would generate cc 1.1 PTX code. All GPUs now and in the future should be able to JIT-compile from this version of PTX to some useful machine code (with the exception of cc 1.0 devices). However, by specifying additional PTX versions, you may, by adding a "newer" PTX, provide an opportunity to take better advantage of a newer architecture, and thus your code might run faster on, say, a cc3.0 device, if you also specified `compute_30`. It's a tradeoff between code size/compile time and best perf. Your mileage may vary. – Robert Crovella Jul 13 '13 at 19:49
  • Unfortunately my comment above was not clear on how to generate PTX. Please refer to my answer which I have edited to reflect how to generate PTX suitable for JIT. – Robert Crovella Mar 03 '14 at 07:57
  • Sometimes I see arch=compute_xx followed by code=compute_xx. What does it means? –  Nov 10 '15 at 11:26
  • 1
    It means you are requesting nvcc to embed that version of *PTX* (instead of that version of *SASS*) in the executable object. – Robert Crovella Nov 10 '15 at 13:53
5

The purpose of multiple -arch flags is to use the __CUDA_ARCH__ macro for conditional compilation (ie, using #ifdef) of differently-optimized code paths.

See here: http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-identification-macro

Aleksandr Dubinsky
  • 22,436
  • 15
  • 82
  • 99