1

This code doesn't work in the same way when compiled with different compute capabilities:

#include <cuda.h>
#include <stdio.h>

__managed__ int m;

int main() {
    printf("hi 1\n");
    m = -123;
    printf("hi 2\n");
}

Device with compute capability 6.0:

$ nvcc main.cu -gencode arch=compute_60,code=sm_60 -rdc=true && ./a.out
hi 1
hi 2

Device with compute capability 7.0:

$ nvcc main.cu -gencode arch=compute_60,code=sm_60 -rdc=true && ./a.out
hi 1
Segmentation fault

Device with compute capability 7.0:

$ nvcc main.cu -gencode arch=compute_70,code=sm_70 -rdc=true && ./a.out
hi 1
hi 2

Why I have Segmentation fault when building with compute capability 6.0 and run it on GPU with compute capability 7.0?

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • Because it does not support it? Not all features are supported by all compute capability. In fact, this is the only reason why knowing your compute capability is important! – Ander Biguri Oct 25 '19 at 09:04
  • @AnderBiguri According to the CUDA docs, unified memory is supported starting with compute capability 6.0. I mentioned that it actually works on GPU with compute capability 6.0. The problem is: it doesn't work on GPU which supports compute capability 7.0 when I compile binary for compute capability 6.0. –  Oct 25 '19 at 09:26
  • Ah, I misunderstood. But notice that when you compile for a single compute capability, only code for that particular `cc` is generated. If you want to use that code for 7.0 then you need to compile for 7.0. You can concatenate and compile for various as: `nvcc main.cu -gencode arch=compute_60,code=sm_60 arch=compute_70,code=sm_70 -rdc=true && ./a.out` – Ander Biguri Oct 25 '19 at 09:28
  • @AnderBiguri Yes, it is actually a good way to avoid this problem. But why does it appears at all? I thought that devices with higher compute capability support binaries with lower compute capability. Isn't it correct? –  Oct 25 '19 at 09:30
  • 1
    I am not sure about that. In my tool I generally get "no kernel image is available to execute on the device" if a run CUDA code compiled for older `cc`, even for just the previous version. You may not be getting that because you do not have a kernel – Ander Biguri Oct 25 '19 at 09:33
  • @AnderBiguri I tried to put kernel inside and host code run perfectly, but the kernel itself is not executed (all host code before and after kernel call with cudaDeviceSynchronize works and cudaDeviceSynchronize returns cudaSuccess). It seems that the CUDA installation in my environment is silent about using incorrect compute capability. –  Oct 25 '19 at 09:53
  • 1
    Every CUDA installation is silent about using incorrect compute capability unless you [explicitly check for errors](https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). – tera Oct 25 '19 at 10:03
  • @tera I explicitly check cudaDeviceSynchronize after kernel call. It returns cudaSuccess, but all the `printf` calls inside kernel doesn't show up when I use wrong compute capability. –  Oct 25 '19 at 10:06
  • 2
    Read the linked answer closely, because error checking of kernel launches is a bit quirky. You need to check the return value of `cudaPeekAtLastError()` or `cudaGetLastError()` before calling `cudaDeviceSynchronize()` to check for launch errors like this, otherwise the error gets reset before you notice it. – tera Oct 25 '19 at 10:22
  • @tera Thank you, I didn't know that! I checked it and it works as it should. –  Oct 25 '19 at 10:28

1 Answers1

0

According to the discussion in the comments, I have this problem because I must use exactly the same compute capability during the build as my GPU has. The reason why I didn't receive any errors is that I should check them manually (What is the canonical way to check for errors using the CUDA runtime API?).

If I extend the piece of code like this:

#include <cuda.h>
#include <stdio.h>

__managed__ int m;

__global__ void foo() {
    printf("from foo: %d %d\n", blockIdx.x, threadIdx.x);
}

int main() {
    foo<<<2,2>>>();
    printf("001\n");
    if (cudaPeekAtLastError() != cudaSuccess) abort();
    printf("002\n");
    if (cudaDeviceSynchronize() != cudaSuccess) abort();
    printf("hi 1\n");
    m = -123;
    printf("hi 2\n");
}

Device with compute capability 7.0:

$ nvcc main.cu -gencode arch=compute_70,code=sm_70 -rdc=true && ./a.out
001
002
from foo: 0 0
from foo: 0 1
from foo: 1 0
from foo: 1 1
hi 1
hi 2

Device with compute capability 7.0:

$ nvcc main.cu -gencode arch=compute_60,code=sm_60 -rdc=true && ./a.out
001
Aborted
  • You don't have to compile for the exact compute capability of the GPU running the code. It is sufficient to include PTX code that can be compiled just-in-time on program loading (which is the default if you don't specify anything). But if you disable that by explicitly specifying to produce a binary for a real architecture, then that has to match. – tera Nov 05 '19 at 00:08