1

I am on Ubuntu 16.04. Suppose I am given a random libtestcuda.so file, is there anyway I can check what CUDA compute compatibility is the library compiled with?

I have tried

ll libtestcuda.so

It doesn't show much.

I want to know this because if I compile my code with

-gencode arch=compute_30,code=sm_30;

It compiles and runs fine on a small cuda program I wrote, but when I run deviceQuery on my GPU it actually shows CUDA compute compatibility 3.5, so I am curious to know whether this code will be executed in the 3.0 or 3.5 architecture.

If I compile and run it with

-gencode arch=compute_20,code=sm_20;

or

-gencode arch=compute_50,code=sm_50;

it fails as expected.

If I compile and run it with

-gencode arch=compute_35,code=sm_35;

it runs fine as expected.

user3667089
  • 2,996
  • 5
  • 30
  • 56
  • 1
    The [CUDA binary utilities](http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#axzz4MjwJknqL) can help with this. You can figure out what PTX and SASS versions are contained within a compiled library. It's also not clear whether you understand general concepts around PTX, SASS, and what that means for compatibility with a particular GPU. That topic is covered in a number of SO questions as well as the [nvcc documentation](http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#axzz4MjwJknqL). – Robert Crovella Oct 11 '16 at 16:25
  • @RobertCrovella I tried out cuobjdump, but when I compile the code with both 3.0 and 3.5 cuobjdump shows both, so it's still unclear to me once this code is executed which architecture would it run on. If you could point to me which SO questions has this answered I will happily delete this post. Thanks! – user3667089 Oct 11 '16 at 16:37
  • I am actually surprised that sm_20 code fails on sm_35 machine. I have run quite big pieces of sm_20 code on sm_35 machine without a problem, albeit it was not very efficient to do so. – CygnusX1 Oct 11 '16 at 17:09
  • 2
    Your question (at least the title) is how to find out which compute capability a library is compiled with. I think cuobjdump will tell you that. The CUDA runtime will select: 1. an exact SASS match for the architecture, if one exists. 2. a compatible SASS match for the architecture, if one exists. 2. The highest available PTX which is less than or equal to the compute capbility of the device, JIT-compiled to an appropriate SASS. – Robert Crovella Oct 11 '16 at 17:11
  • Also, note that a lib can include *multiple* versions of kernel for different machines, as well as machine-independent PTX code, compiled on-the-fly. – CygnusX1 Oct 11 '16 at 17:11
  • 1
    The only way sm20 code will run on an sm35 device is if PTX is available. The particular compiler switches selected in the question produce SASS but no PTX. For sm20, that would fail on an sm35 device and it has always been that way. If you can run "sm20 code" on a sm35 device, it means you have PTX included. – Robert Crovella Oct 11 '16 at 17:12
  • @RobertCrovella Thanks Robert, if you will write up answer I will accept it – user3667089 Oct 11 '16 at 17:14
  • A related question is [here](http://stackoverflow.com/questions/17599189/what-is-the-purpose-of-using-multiple-arch-flags-in-nvidias-nvcc-compiler) and [here](http://stackoverflow.com/questions/35656294/cuda-how-to-use-arch-and-code-and-sm-vs-compute) – Robert Crovella Oct 11 '16 at 17:16

1 Answers1

8

For general background on the use of flags to tell nvcc which architectures to compile for, I would suggest this question and this question, as well as the nvcc documentation.

After discussion in the comments, there appear to be two questions. (Although these questions have libraries in view, most of the comments apply equally to executable objects as well.)

How can I discover which architectures (PTX, SASS) a particular library has been compiled for?

This can be discovered using the CUDA binary utilities e.g. cuobjdump. In particular, the -ptx switch will list all PTX objects contained, and the -sass switch will list all SASS objects contained. A library that is compiled for the "real architecture" of sm_30 for example will contain sm_30 SASS code, and this will be evident in the cuobjdump output. A library that is compiled for the "virtual architecture" compute_50 for example will contain compute_50 PTX code, and this will be evident in the cuobjdump output. Note that a library (or any CUDA fatbin object) may contain code for multiple architectures, both PTX and SASS, or multiple SASS versions.

If a library contains multiple architectures, how do I know what will actually execute on the device.

At application launch, the CUDA runtime inspects the binary object for the application, and will use, roughly speaking, the following heuristic to determine what will run on the GPU:

  1. If an exact SASS match exists in the binary object, then the runtime will use that for the GPU. This means for example that if your object (executable, or library) contains an entry for sm_35 SASS code, and you are running on a sm_35 (i.e. a compute capability 3.5) GPU, then the CUDA runtime will select that.

  2. If item 1 is not satisfied, the CUDA runtime will next choose a "compatible" SASS entry, if one exists. This is not well defined/specified AFAIK, but in general a sm30 SASS object should be usable on any sm_3x device, and likewise for sm20 SASS on a sm_2x device, or sm50 SASS on any sm_5x device. For other questions (e.g. will sm32 SASS be usable directly on a sm35 device) I don't have a complete table that specifies compatibility. Specific cases can be tested using the methodology exposed in the question: build an object containing only a particular SASS type, and see if it will run on the intended GPU.

  3. If items 1 and 2 are not satisfied, the CUDA runtime will search for a compatible PTX entry. For a given GPU type of compute capability x.y, a compatible PTX entry is defined to be PTX for architecture z.w, where z.w is less than or equal to x.y. cc2.0 PTX is compatible with a cc3.5 device, for example. cc5.0 PTX is not compatible with a cc3.5 device. Once the highest numbered PTX entry is found that meets this criterion, it will be JIT-compiled by the GPU driver to produce a necessary SASS object, on-the-fly, at runtime.

If none of the items 1, 2, or 3, are satisfied, the GPU code will return a runtime error at any and all calls into the CUDA runtime library (NO BINARY FOR GPU, or similar).

I've glossed over a number of concepts related to "real" and "virtual" architectures. This is a complex topic and I recommend reading the nvcc documentation linked above for background. For example, it is not correct that any given compute capability has the same numerical architectures avaialble for both real (SASS) and virtual (PTX). For cc 2.0, for example, both real (sm_20) and virtual (compute_20) architectures exists. For cc2.1, for example, only the real architecture (sm_21) exists, the virtual architecture (compute_21) does not exist and the compute_20 architecture should be specified instead. This will be readily evident if you attempt to compile for compute_21, for example.

One might also ask "given all this", what architectures should I compile for?

This question has been answered on many previous SO questions, and is somewhat a matter of opinion. As a useful reference point, I suggest following a strategy used by the projects for the CUDA sample codes.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257