2

I have the following minimal .cu file

#include <cuda_runtime_api.h>
#include <cublas_v2.h>
#include <cstdio>

__global__ void test()
{
    cublasHandle_t handle = nullptr;
    cublasCreate(&handle);
}

int main(int, char**)
{
    void * data = nullptr;
    auto err = cudaMalloc(&data, 256);
    printf("%s\n", cudaGetErrorString(err));
    return 0;
}

As you can see, the test kernel isn't even being called, however cudaMalloc returns 30 (unknown error). The file is being compile with separable compilation (required for dynamic parallelism) and compute capability 5.2 (also tried 3.5 and 5.0, which didn't change anything). Removing the call to cublasCreate causes cudaMalloc to return 0 (no error).

What could be the cause? And how can I fix it? I need to call CUBLAS from a kernel using dynamic parallelism which is theoretically supported, so "just remove the call" is not an option.

Here is the corresponding CMakeLists.txt:

cmake_minimum_required(VERSION 3.3 FATAL_ERROR)
project(CublasError)

find_package(CUDA REQUIRED)

set(CUDA_SEPARABLE_COMPILATION ON)
set(CUDA_NVCC_FLAGS --gpu-architecture=compute_52 -Xptxas=-v)
list(APPEND CUDA_NVCC_FLAGS_DEBUG -G -keep -O0)

cuda_add_executable(${PROJECT_NAME} main.cu)
cuda_add_cublas_to_target(${PROJECT_NAME})

# FindCUDA.cmake does not automatically add (or find) cudadevrt which is required when separable compilation is on
if(CUDA_SEPARABLE_COMPILATION)
    get_filename_component(CUDA_LIB_PATH ${CUDA_CUDART_LIBRARY} DIRECTORY)
    find_library(CUDA_cudadevrt_LIBRARY cudadevrt PATHS ${CUDA_LIB_PATH})
    target_link_libraries(${PROJECT_NAME} ${CUDA_cudadevrt_LIBRARY})
endif()

Here is a set of theoretically similar compile commands (the result is at least the same):

nvcc -dc --gpu-architecture=compute_52 -m64 main.cu -o main.dc.obj
nvcc -dlink --gpu-architecture=compute_52 -m64 main.dc.obj -o main.obj
link /SUBSYSTEM:CONSOLE /LIBPATH:"%CUDA_PATH%\lib\x64" main.obj main.dc.obj cudart_static.lib cudadevrt.lib cublas.lib cublas_device.lib
Joe
  • 6,497
  • 4
  • 29
  • 55
  • Unless I am mistaken, it would be impossible to compile that source you have posted into an executable and run it. Could you edit your question to explain how you compile and run that source to reproduce the problem? – talonmies Sep 19 '16 at 09:27
  • Why would it be impossible? It compiles just fine using `nvcc` – Joe Sep 19 '16 at 09:28
  • Please edit the exact compilation command, and compiler version you are using into your question. AFAIK the separate compilation trajectory cannot have `main` and the separately compiled kernel code in the same file – talonmies Sep 19 '16 at 09:34
  • I added the `CMakeLists.txt` that I use. Dynamic parallelism in general works fine with this setup. – Joe Sep 19 '16 at 09:35
  • That isn't helpful without the actual compiler statements used to build this. If this is compiling just fine, it should be possible to just write down the single nvcc line which compiles the executable – talonmies Sep 19 '16 at 09:39
  • It's not that easy as it turns into 3-4 calls (not all of which are displayed when verbose is turned on). I'll see what I can do. – Joe Sep 19 '16 at 09:42
  • @talonmies I added (approximate) compile commands which at least have the same result. I'm not 100% certain they are exactly equivalent, but they should be close enough. – Joe Sep 19 '16 at 11:38

1 Answers1

1

It turns out that nvcc -dlink does not report missing dependencies and just happily continues without emitting any errors. The solution to the problem is that cublas_device.lib must be linked both during host linking and device linking, i.e. the compile commands should look as follows:

nvcc -dc --gpu-architecture=compute_52 -m64 main.cu -o main.dc.obj
nvcc -dlink --gpu-architecture=compute_52 -m64 -lcublas_device main.dc.obj -o main.obj
link /SUBSYSTEM:CONSOLE /LIBPATH:"%CUDA_PATH%\lib\x64" main.obj main.dc.obj cudart_static.lib cudadevrt.lib cublas.lib cublas_device.lib

Also, nvcc -dlink is order dependent, but in the opposite manner that one is used to from ld: -lcublas_device must appear before the object files that require it.

On the CMake side of things, cuda_add_cublas_to_target fails to add cublas_device.lib to the device link command and only adds it to the host link command. As a workaround, add the dependency explicitly to the list of nvcc flags:

list(APPEND CUDA_NVCC_FLAGS -lcublas_device)
Joe
  • 6,497
  • 4
  • 29
  • 55
  • With the linux toolchain, your instructions do fail with a missing dependency at the device link phase. I think you actually only need a device compile and global link (so no device link) to make a working executable. My confusion over whether this would work comes from the device compile phase. I am 99% sure that in earlier tool chain versions, that wouldn't emit any device code, so `main`would be missing, and then you should have had comflicts try to compile main without duplicates. But device linking is subtle and it is easy to get wrong, as you have seen here – talonmies Sep 19 '16 at 12:00
  • Yes, it is possible to do the last two steps in a single call, but cmake creates two separate calls. Should be possible only linux too. – Joe Sep 19 '16 at 12:01