0

From the NVIDIA documentation, when PTX, CUBIN or FATBIN is generated, the host code gets discarded from the file. Now I have my host code (main.cu) and the device code (shared.cu). When compiling each file to *.o using the nvcc option nvcc -c main.cu shared.cu or even with nvcc -dc main.cu shared.cu and linking them with the option nvcc -link main.o shared.o, I can generate the executable. But when shared.cu is compiled to shared.cubin and further to *.o, then the linking fails with an error tmpxft_00001253_00000000-4_main.cudafe1.cpp:(.text+0x150): undefined reference to <KERNEL FUNCTION>

Here I wonder shared.cu contains only device code and even if the host code is removed why the linking should fail.

The source code files are main.cu

#include <stdio.h>
#include <cuda_runtime_api.h>
#include <cuda_runtime.h>
#include <cuda.h>
#include "shared.h"
 int main()
{
        int a[5]={1,2,3,4,5};
        int b[5]={1,1,1,1,1};
        int c[5];
        int i;

        int *dev_a;
        int *dev_b;
        int *dev_c;

        cudaMalloc( (void**)&dev_a, 5*sizeof(int) );
        cudaMalloc( (void**)&dev_b, 5*sizeof(int) );
        cudaMalloc( (void**)&dev_c, 5*sizeof(int) );

        cudaMemcpy(dev_a, a , 5 * sizeof(int), cudaMemcpyHostToDevice);
        cudaMemcpy(dev_b, b , 5 * sizeof(int), cudaMemcpyHostToDevice);

        add<<<1,5>>>(dev_a,dev_b,dev_c);

        cudaMemcpy(&c,dev_c,5*sizeof(int),cudaMemcpyDeviceToHost);

        for(i = 0; i < 5; i++ )
        {
                printf("a[%d] + b[%d] = %d\n",i,i,c[i]);
        }
        cudaFree( dev_a);
        cudaFree( dev_b);
        cudaFree( dev_c);
        return 0;
}

shared.cu

#include<stdio.h>

__global__  void add(int *dev_a, int *dev_b, int *dev_c){

        //allocate shared memory
        __shared__ int a_shared[5];
        __shared__ int b_shared[5];
        __shared__ int c_shared[5];
        {
                //get data in shared memory
                a_shared[threadIdx.x]=dev_a[threadIdx.x];
                __syncthreads();

                b_shared[threadIdx.x]=dev_b[threadIdx.x];
                __syncthreads();

                //perform the addition in the shared memory space
                c_shared[threadIdx.x]= a_shared[threadIdx.x] + b_shared[threadIdx.x];
                __syncthreads();

                //shift data back to global memory
                dev_c[threadIdx.x]=c_shared[threadIdx.x];
                __syncthreads();
        }
}

shared.h

#ifndef header
#define header

extern __global__  void add(int *dev_a, int *dev_b, int *dev_c);

#endif
Ginu Jacob
  • 1,588
  • 2
  • 19
  • 35

1 Answers1

2

I believe you are assuming that a "device code only" file (such as your shared.cu) contains no host code. This is not actually correct.

A kernel function generates both host and device code constructs, and these constructs are generated by the CUDA preprocessor (cudafe) and split apart. Refer to the CUDA compilation trajectory in the documentation.

Note that there is an initial separation of host and device code, followed by the creation later of the .cudafe1.stub.c file, which is then passed over to the host side (i.e. separated from the cubin path, effectively to enable linking).

As indicated in that diagram, the .cudafe1.stub.c file does not become part of the cubin, but enters the host-side processing stream, ultimately becoming part of the fatbinary file.

If you process only to cubin, you are discarding this .cudafe1.stub.c, and this is necessary for final link to create an executable fat binary.

Therefore attempting to create a fat binary link with with just a cubin will fail, due to the missing reference in the stub file, which reference is indicated in the error output in your question.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Is it possible to generate .cudafe1.stub.c separately and use it with the cubin/fatbin file to do the final linking.... Or is there an option to avoid discarding host constructs when generating fatbin/cubin.... – Ginu Jacob Jun 22 '16 at 00:17
  • 1
    It's not possible using standard toolchain approaches. You can experiment with the nvcc options to keep intermediate files and using verbose output you can essentially reproduce the assembly sequence yourself. I'm not sure why you would want to. The toolchain will already provide the necessary linking if you create a fat binary using standard methods. Using PTX and cubin are primarily for supporting the CUDA driver API methods. – Robert Crovella Jun 22 '16 at 00:25
  • The only idea is to modify the PTX a little bit. With the post http://stackoverflow.com/questions/20012318/how-to-compile-ptx-code I was unable to succeed and got the following error : – Ginu Jacob Jun 22 '16 at 00:32
  • sh: 1: bin2c: not found In file included from t266.cudafe1.stub.c:1:0: t266.cudafe1.stub.c: In function ‘void __sti____cudaRegisterAll_12_t266_cpp1_ii_ea754d28()’: t266.cudafe1.stub.c:2:126: error: ‘__fatDeviceText’ was not declared in this scope #include "crt/host_runtime.h" sh: 1: bin2c: not found crt/link.stub: In function ‘void __cudaRegisterLinkedBinary(const __fatBinC_Wrapper_t*, void ()(void*), void*)’: crt/link.stub:102:60: error: ‘__fatDeviceText’ was not declared in this scope __cudaFatCubinHandle = __cudaRegisterFatBinary((void*)&__fatDeviceText); – Ginu Jacob Jun 22 '16 at 00:33
  • 1
    I guess you did something wrong then? I just ran through the recipe there and it still works for me. Did you try to use the script I posted or did you actually edit your own script from the dryrun.out file? Anyway I have just responded to your comment on that question with a link to a full console session demonstrating every step of the process. If you still have trouble, please don't try to sort it out in the comments of an unrelated question -- post a new question. – Robert Crovella Jun 22 '16 at 00:58