0

I have a relatively simple CUDA kernel and I immediately call the kernel in the main method of my program in the following way:

__global__ void block() {
    for (int i = 0; i < 20; i++) {
        printf("a");
    }
}

int main(int argc, char** argv) {
    block << <1, 1 >> > ();
    cudaError_t cudaerr = cudaDeviceSynchronize();
    printf("Kernel executed!\n");
    if (cudaerr != cudaSuccess)
        printf("kernel launch failed with error \"%s\".\n",
                cudaGetErrorString(cudaerr));
}

This program is compiled and launched using Visual Studio 2015, and the project being executed has been generated with CMAKE using the following CMakeLists.txt file:

project (Comparison)

cmake_minimum_required (VERSION 2.6)
find_package(CUDA REQUIRED)

set(
    CUDA_NVCC_FLAGS
    ${CUDA_NVCC_FLAGS};
   -arch=compute_30  -code=sm_30 -g -G
    )


cuda_add_executable(Comparison kernel.cu)

I would expect the output of this program to print 20 A's to the console and then end with printing kernel executed. However, the A's are never printed to the console and the line Kernel executed shows up immediately. Even if I replace the for loop by a while(true) loop.

Even when running the code with the Nsight debugger attached and a breakpoint in the for loop of the kernel nothing happens. Leading me to believe that the kernel is never actually launched. Does anyone know how to make this kernel behave as expected?

too honest for this site
  • 12,050
  • 4
  • 30
  • 52
TheDutchDevil
  • 826
  • 11
  • 24
  • What happens if you add a call to cudaDeviceReset at the end of your main()? – talonmies Mar 27 '17 at 19:23
  • Nothing, it just returns a cudasuccess code without doing anything. – TheDutchDevil Mar 27 '17 at 19:34
  • 2
    What GPU is in your system? What GPU driver version are you using? Which CUDA version? – Robert Crovella Mar 28 '17 at 01:11
  • The code as you have published it, works perfectly for me. If the `cudaDeviceReset()` at the end of the main makes no difference, then we can rule out buffer flushing issues, in which case the problem lies in something you haven't described, or you have a CUDA installation issue of some sort. Very hard to solve a problem which can't be reproduced.... – talonmies Mar 28 '17 at 08:35
  • I'm fearing it's something to with CMake as I tried the same code on a project not generated with CMake and it also works. However, I'm not enough of a CMake expert to be ablte to analyze what goes wrong. – TheDutchDevil Mar 28 '17 at 13:33
  • I find that very hard to believe. What happens if you just compile that code from the command line using nvcc and then run it? – talonmies Mar 28 '17 at 15:06
  • @TheDutchDevil: (1) Maybe your output is being redirected somewhere and you just don't see it? (2) Remote chance that this might be the problem, but - are you including `` properly? (3) Shout out to a fellow GPU programmer in the Netherlands :-) ... come to CWI some time. – einpoklum Mar 28 '17 at 22:30
  • Since you don't seem to want to answer my questions, I would suggest implementing proper error checking after the kernel call. What you have shown so far is incomplete. An "invalid device function"/NO_BINARY_FOR_GPU error on the kernel launch would not be caught by the error checking you have shown. – Robert Crovella Mar 30 '17 at 02:26
  • @RobertCrovella, Sorry, I've been away from my desktop for most of the past week. The GPU I'm using is a GTX 970, the driver version is 376.53 and my CUDA version is 8.0.60. Furthermore, compiling the code from the command line using nvcc does make the code work as expected, as it compiles and then prints 10 a's. – TheDutchDevil Apr 04 '17 at 18:21
  • 1
    This is a problem then: `-arch=compute_30 -code=sm_30` as it does not match your GTX 970, and also it does not force the inclusion of PTX, which would allow for forward-compatibility of the code (JIT compilation). If you change that to match your GPU (e.g. just `-arch=sm_52`, or `-arch=compute_52 -code=sm_52`) then it should start working for you. And if you used [proper CUDA error checking](http://stackoverflow.com/questions/14038589/) after your kernel call, you would at least have gotten an indication of what the problem is. – Robert Crovella Apr 04 '17 at 18:28
  • @RobertCrovella, Yep, that did the trick, thanks so much for the resources on proper error checking in CUDA and correct answer! If you can post this as an actual answer I'll accept it. :) – TheDutchDevil Apr 04 '17 at 18:55

1 Answers1

1

The reason the kernel was not running correctly when compiled with the given CMakeLists.txt file was due to these flags:

-arch=compute_30  -code=sm_30  

combined with the GPU that was being used (GTX 970, a cc 5.2 GPU).

Those flags specify the generation of cc 3.0 SASS code only, and such code is not compatible with a cc 5.2 device. The fix would be to modify the flags to something like:

-arch=sm_30

or

-arch=sm_52

or

-arch=compute_52 -code=sm_52

I would recommend the first or second approach, as it will include PTX support for future devices.

The kernel error was not evident because the error checking after the kernel was incomplete. Refer to the canonical/question answer.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257