0

I'm running into an error when I try to compile CUDA with relocatable device code enabled (-rdc = true). I'm using Visual Studio 2013 as compiler with CUDA 7.5. Below is a small example that shows the error. To clarify, the code below runs fine when -rdc = false, but when set to true, the error shows up.

The error simply says: CUDA error 11 [\cuda\detail\cub\device\dispatch/device_radix_sort_dispatch.cuh, 687]: invalid argument

Then I found this, which says:

When invoked with primitive data types, thrust::sort, thrust::sort_by_key,thrust::stable_sort, thrust::stable_sort_by_key may fail to link in some cases with nvcc -rdc=true.

Is there some workaround to allow separate compilation?

main.cpp:

#include <stdio.h>
#include <vector>
#include "cuda_runtime.h"
#include "RadixSort.h"

typedef unsigned int uint;
typedef unsigned __int64 uint64;

int main()
{
   RadixSort sorter;

   uint n = 10;
   std::vector<uint64> test(n);
   for (uint i = 0; i < n; i++)
      test[i] = i + 1;

   uint64 * d_array;
   uint64 size = n * sizeof(uint64);

   cudaMalloc(&d_array, size);
   cudaMemcpy(d_array, test.data(), size, cudaMemcpyHostToDevice);

   try
   {
      sorter.Sort(d_array, n);
   }
   catch (const std::exception & ex)
   {
      printf("%s\n", ex.what());
   }
}

RadixSort.h:

#pragma once
typedef unsigned int uint;
typedef unsigned __int64 uint64;

class RadixSort
{
public:
   RadixSort() {}
   ~RadixSort() {}

   void Sort(uint64 * input, const uint n);
};

RadixSort.cu:

#include "RadixSort.h"

#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <thrust/sort.h>

void RadixSort::Sort(uint64 * input, const uint n)
{
    thrust::device_ptr<uint64> d_input = thrust::device_pointer_cast(input);
    thrust::stable_sort(d_input, d_input + n);
    cudaDeviceSynchronize();
}
RobbinMarcus
  • 206
  • 2
  • 10
  • Regarding this: `Is there some workaround to allow separate compilation?` which GPU are you running on? – Robert Crovella May 30 '16 at 10:04
  • Currently the GTX 760. – RobbinMarcus May 30 '16 at 10:12
  • Try compiling with the architecture set to match your GTX 760, which should be cc3.0 I believe. – Robert Crovella May 30 '16 at 10:13
  • Thanks, compling with compute_30 and sm_30 was indeed the solution. Any idea as to why _20 is not supported? – RobbinMarcus May 30 '16 at 11:09
  • @Spectrallic: Could you please add a short answer describing your solution for the next person who comes along with the same problem? – talonmies May 30 '16 at 12:30
  • This code, if compiled as described in the question (i.e. for cc2.0) and if actually run on a cc2.0 device, will run correctly. Therefore I think its possible that the behavior here is simply another manifestation of a behavior I describe [here](http://stackoverflow.com/questions/29450778/why-intersection-of-thrust-library-is-returning-unexpected-result/29573827#29573827). If that is the case, the correct solution is simply to compile for the architecture of the GPUs you are running on, which is a general recommendation I make when compiling thrust codes. – Robert Crovella May 31 '16 at 15:49
  • Some changes were made in CUDA 8 which should allow this to compile and run correctly, even if sm_20 is specified on a cc3.0 device, for example. – Robert Crovella Oct 23 '16 at 02:18

1 Answers1

1

As mentioned in the comments by Robert Crovella:

Changing the CUDA architecture to a higher value will solve this problem. In my case I changed it to compute_30 and sm_30 under CUDA C++ -> Device -> Code Generation.

Edit:

The general recommendation is to select the best fit hierarchy for your specific GPU. See the link in comments for additional information.

RobbinMarcus
  • 206
  • 2
  • 10