Separate the host-side and CUDA-device-side versions of library

Question

I have a library with some __host__ __device__ functions. I also have an #ifdef __CUDACC__ gadget which makes sure that a regular C++ compiler doesn't see the __host__ __device__ and can thus compile those functions.

Now, I want to use the compiled host-side version of my library's function in a plain-vanilla C++ static library file (.a on Linux) - and I would even like that library to be compilable when CUDA is unavailable; and I want the compiled device-side versions in a separate static library.

I am almost there (I think), but am stuck with a linking error. Here are toy sources for such a library, a test program (which calls both the device-side and the host-side version of a function) and the build commands I use.

What am I getting wrong?

my_lib.hpp (Library header):

#ifdef __CUDACC__
__host__ __device__
#endif
void foo(int*x, int* y);
int bar();

my_lib.cu (Library source):

#include "my_lib.hpp"

#ifdef __CUDACC__
__host__ __device__
#endif
void foo(int*x, int* y)  { *x = *y; }

int bar() { return 5; }

main.cu (test program):

#include "my_lib.hpp"

__global__ void my_kernel() {
  int z { 78 };
  int w { 90 };
  foo(&z,&w);
}

int main() {
  int z { 123 };
  int w { 456 };
  foo(&z,&w);
  my_kernel<<<1,1>>>();
  cudaDeviceSynchronize();
  cudaDeviceReset();
}

My build commands:

c++ -c -x c++ -o my_lib-noncuda.o my_lib.cu
ar qc my_lib-noncuda.a my_lib-noncuda.o
ranlib my_lib-noncuda.a
nvcc -dc -o my_lib-cuda.o my_lib.cu
ar qc my_lib-cuda.a my_lib-cuda.o
ranlib my_lib-cuda.a
nvcc -dc -o main.rdc.o main.cu
nvcc -dlink -o main.o main.rdc.o my_lib-cuda.a
c++ -o main main.o my_lib-noncuda.a -lcudart

And the errors I get - on the last, linking, command:

/usr/bin/ld: main.o: in function `__cudaRegisterLinkedBinary_39_tmpxft_00003f88_00000000_6_main_cpp1_ii_e7ab3416':
link.stub:(.text+0x5a): undefined reference to `__fatbinwrap_39_tmpxft_00003f88_00000000_6_main_cpp1_ii_e7ab3416'
/usr/bin/ld: main.o: in function `__cudaRegisterLinkedBinary_41_tmpxft_00003f69_00000000_6_my_lib_cpp1_ii_ab44b3f6':
link.stub:(.text+0xaa): undefined reference to `__fatbinwrap_41_tmpxft_00003f69_00000000_6_my_lib_cpp1_ii_ab44b3f6'
collect2: error: ld returned 1 exit status

Notes:

I use CUDA 10.1 and g++ 9.2.1 on Devuan GNU/Linux.
This is a "follow-up" to a deleted question; @talonmies commented I had better show exactly what I did; and that changed the question somewhat.
Somewhat-related question: this one.

That example is broken beyond the errors you describe. There should be no main emitted anywhere in that build sequence either and there should be a main not found error as well, unless I read this wrongly — talonmies, Dec 16 '19 at 22:05
@talonmies: Well, the errors are what they are, although I see what you mean about `main()` and `-dc`. The thing is, if I drop the `-dc`, I get a compilation error: `nvcc -o main.o main.cu` results in `Unresolved extern function '_Z3fooPiS_'`. — einpoklum, Dec 16 '19 at 22:27
I'm confused about what you want to do. I thought you want to link a "plain-vanilla C++ static library file" to a C++ program. But as I see it you would like to use a C++ compiler in the last step for linking only and still have all the CUDA stuff present. Maybe this is what you are looking for: https://devblogs.nvidia.com/separate-compilation-linking-cuda-device-code/ - Section "Advanced Usage: Using a Different Linker" — BlameTheBits, Dec 17 '19 at 12:22
@Shadow: Not quite. I want to have a "split library": One `.a` file of host-side versions of functions, which I could use in regular C++ linking regardless of CUDA; and another `.a` file, of device-side versions of functions, which I can link with my CUDA code that calls these functions from within a kernel. The test program exemplifies the second kind of use. — einpoklum, Dec 17 '19 at 12:30

talonmies · Answer 1 · 2019-12-17T23:26:19.687

1

Let us modify your example into what I think your actual usage case would be. The modification places main() into a .cpp file, to be compiled by g++, and the CUDA code into a separate .cu file, to be compiled by nvcc. This is important to making your two-library setup work; and justifiable, because the "main contains CUDA kernels requiring separate compilation and linkage" is a peculiar corner case for the nvcc compilation model.

The restructured code:

main.cu:

include "my_lib.hpp"

__global__ void my_kernel() {
  int z { 78 };
  int w { 90 };
  foo(&z,&w);
}

int cudamain()
{
  my_kernel<<<1,1>>>();
  return 0;
}

main.cpp:

#include <cuda_runtime_api.h>
#include "my_lib.hpp"

extern int cudamain();

int main() {
  int z { 123 };
  int w { 456 };
  foo(&z,&w);
  cudamain();
  cudaDeviceSynchronize();
  cudaDeviceReset();
}

all other files remain as in the question.

The commands required to build the program are now:

c++ -c -x c++ -o my_lib-noncuda.o my_lib.cu
ar qc my_lib-noncuda.a my_lib-noncuda.o
ranlib my_lib-noncuda.a

nvcc -std=c++11 -dc -o my_lib-cuda.rdc.o my_lib.cu
ar qc my_lib-cuda.a my_lib-cuda.rdc.o
ranlib my_lib-cuda.a

# Until this line - identical to what you have tried in your question

nvcc -std=c++11 -c -rdc=true main.cu -o main.cu.o 
nvcc -dlink -o main.o main.cu.o my_lib-cuda.a

c++ -std=c++11 -o main main.cpp main.o main.cu.o -I/path/to/cuda/include \
    -L/path/to/cuda/lib64 my_lib-cuda.a my_lib-noncuda.a -lcudart -lcudadevrt

The important thing to keep in mind there are host side components which need to be carried forward in the build. Thus you must pass the nvcc output of the CUDA host code to the main linkage, and you must also add your CUDA side library to the main linkage. Otherwise the host-side runtime API support for your code will be missing. Note also you must link the device runtime library to make this work.

edited Dec 17 '19 at 23:26

answered Dec 17 '19 at 14:00

talonmies

70,661
34
192
269

Linking against `my_lib-noncuda.a` is not necessary here, or is it? – BlameTheBits Dec 17 '19 at 15:55
Yes. The main calls the host version of the host/device function – talonmies Dec 17 '19 at 15:59
I just copied the exact code and compilation commands and only removed the link against `my_lib-noncuda.a`. I got no error. – BlameTheBits Dec 17 '19 at 16:08
I didn't look at the host side of things to be honest, just the missing dependencies and lack of main – talonmies Dec 17 '19 at 17:16
@talonmies; There is still duplication with the libraries - which is a problem in my own build attempt in the question. Like @Shadow suggests - `my_lib-cuda.a` has both the host-side and device-side versions of `foo()`, as well as the host-side version of `bar()`. And if you switch the linking order in your last build command - first the `-cuda.a`, then the `-noncuda.a` (which is quite likely to happen inadvertently with build system generators or just naively) - you get errors: – einpoklum Dec 17 '19 at 21:35
`/usr/bin/ld: my_lib-cuda.a(my_lib-cuda.rdc.o): in function \`foo(int*, int*)': tmpxft_00001859_00000000-5_my_lib.cudafe1.cpp:(.text+0x16): multiple definition of \`foo(int*, int*)'; my_lib-noncuda.a(my_lib-noncuda.o):my_lib.cu:(.text+0x0): first defined here /usr/bin/ld: my_lib-cuda.a(my_lib-cuda.rdc.o): in function \`bar()': tmpxft_00001859_00000000-5_my_lib.cudafe1.cpp:(.text+0x31): multiple definition of \`bar()'; my_lib-noncuda.a(my_lib-noncuda.o):my_lib.cu:(.text+0x1b): first defined here collect2: error: ld returned 1 exit status` – einpoklum Dec 17 '19 at 21:35
@talonmies; Also - you're right about my intended use, so the modification is basically acceptable :-) – einpoklum Dec 17 '19 at 21:41
Your TLDR comment edit was incorrect and I have removed it. The fix isn't caused by the separation of main and cuda, it is caused by the additional objects that are required to be linked. I you split the main and compile and link as you were doing, the error will still persist. My only point was that you would need to do something very different if you want kernels to link in the same translation unit as your main. And I [don't get duplicate symbol errors using this exact recipe on this exact code](https://pastebin.com/f2JUUtRp) – talonmies Dec 18 '19 at 11:55

BlameTheBits · Answer 2 · 2019-12-17T22:36:28.003

1

Here is how you could create two libraries, one containing only CUDA-device functions and the other containing only host functions. You could omit the "complicated" #if and the #ifndef guard. But then you would have also the "non-CUDA-code" in your library my_lib-cuda.a.

For the other issues see @talonmies community wiki answer or refer to the link I already posted in the comments: https://devblogs.nvidia.com/separate-compilation-linking-cuda-device-code/ - Section "Advanced Usage: Using a Different Linker".

my_lib.cu

#include "my_lib.hpp"

#ifdef __CUDA_ARCH__
__device__
#endif
#if (defined __CUDA_ARCH__) || (not defined __CUDACC__)
void foo(int*x, int* y)  { *x = *y; }
#endif

#ifndef __CUDACC__
int bar() { return 5; }
#endif

The build process of the libraries stays the same: (only changed ar qc to ar rc to replace existing files so you don't get an error when rebuilding without deleting the library beforehand)

c++ -c -x c++ -o my_lib-noncuda.o my_lib.cu
ar rc my_lib-noncuda.a my_lib-noncuda.o
ranlib my_lib-noncuda.a
nvcc -dc -o my_lib-cuda.o my_lib.cu
ar rc my_lib-cuda.a my_lib-cuda.o 
ranlib my_lib-cuda.a

Building a CUDA program: (simplified by using only nvcc and not c++, alternatively have a look at @talonmies community wiki answer)

nvcc -dc main.cu -o main.o
nvcc main.o my_lib-cuda.a my_lib-noncuda.a -o main

The link to my_lib-noncuda.a can be omitted if you also omit the #if and #ifndef in my_lib.cu as described above.

Building a C++ program: (given that there are #ifdef __CUDACC__ guards around the CUDA code in main.cu)

c++ -x c++ -c main.cu -o main.o
c++ main.o my_lib-noncuda.a -o main

edited Dec 17 '19 at 22:36

answered Dec 17 '19 at 15:01

BlameTheBits

850
1
6
22

The thing is, that if I take this approach, I'll get clashes. That is, suppose my application directly uses my_lib on the device side and other_lib on the host side. Now, other_lib in turn uses my_lib on the host side. When I'll try and link everything together the host-side functions will be found in two conflicting places, won't they? – einpoklum Dec 17 '19 at 17:02
If you do it as I have done, my_lib would consist of only device or only host code. So your example would not make sense. And conflicts are always something you have to look out for and with what I have done there are no additional potential conflicts. – BlameTheBits Dec 17 '19 at 20:54
So, this fails. That is, the second of the two lines for building the CUDA program fails, with: `nvlink error : Multiple definition of '_Z3fooPiS_' in 'my_lib-cuda.a:my_lib-cuda.rdc.o', first defined in 'my_lib-cuda.a:my_lib-cuda.o' nvlink fatal : merge_elf failed` - unless your changes are relative to what @talonmies wrote. – einpoklum Dec 17 '19 at 21:11
Oh. You're right. But only if I repeat some compilation steps. On the first compilation everything works fine for me. But that seems to be a problem of creating the library. On the second run `my_lib-cuda.a` is not rewritten but appended. – BlameTheBits Dec 17 '19 at 22:09
Indeed, the problem is with the way I create the two libraries in the question. I need to do something else... but remember that the separation of host and device code is the crux of the question, not the compilation of the example program. – einpoklum Dec 17 '19 at 22:17
Yes. But isn't that exactly what I'm doing? The library `my_lib-cuda.a` only contains device code and `my_lib-noncuda.a` only contains host code. – BlameTheBits Dec 17 '19 at 22:20
I have added the missing parts. But building the libraries really stays the same. Of course you could also split also the main file like @talonmies did but I left that out as I understood your question as "how to have an only-device- and an only-host-library". And in your example you choose to not include the kernel in the device-library so I thought that the content and the structure of the main file is not of importance. – BlameTheBits Dec 17 '19 at 22:43
Ok, now that it works for me, +1. But I don't quite understand _why_ this works: (1.) How can the switches to the `ar` commands affect the linking in a later command? There's no need for replacement until later. (2.) how come nvcc is willing to accept a `__device__`-only function in the `.cu`, when the header says `__host__ __device__`? – einpoklum Dec 17 '19 at 22:45
(1) The library is kind of just a collection of implementations. It does not care if it holds two implementations for the same function. But when you actually need to call the function, that is when you link against the library, you will get a linking error. Therefore you need to make sure to not just blindly append files (that is "implementations of functions") but to only append them if they are not already present. I honestly don't no when "appending" option instead of "replace if already present or add if not present" is ever a useful thing. – BlameTheBits Dec 17 '19 at 22:54
(2) You can actually have two completely different implementations of the function. (See my first or second version of the answer). And of course it is always OK to have some declarations without definitions as long as you add the definitions before you actually use them (which is done in the linking step). In summary you could say: The host version and the device version of `foo` are two completely different and independent functions. It may just be convenient for you to have a host and a device function with the same name (and maybe also save a few lines of duplicate code). – BlameTheBits Dec 17 '19 at 22:58
And won't I have a problem writing `__host__ __device__` functions later, which call `bar()`? – einpoklum Dec 17 '19 at 23:19
Only a host function can call `bar` (which is a host function) of course. But that's always the case. Or do you mean `__host__ __device__` functions which call `foo`? I see no problem with that. – BlameTheBits Dec 17 '19 at 23:26

Separate the host-side and CUDA-device-side versions of library

2 Answers2