0

I developed an algorithm using thrust. My office computer has one CUDA enabled card with architecture:

--- General information about Device 0 Name: Quadro 2000 Compute Capability: 2.1 Clock Rate: 1251000 kHz Device Overlap: Enabled Kernel Execution Timeout: Disabled

On this machine, my algorithm runs with no errors. However, a clean build on a lab machine throws a nasty thrust::system::system_error when attempting to generate a device_vector. Both machines are running RedHat 6 and are configured identically, with the exception of multiple graphics cards. This lab machine contains three CUDA enabled cards with the following architectures:

--- General information about Device 0 Name: Tesla C2050 Compute Capability: 2.0 Clock Rate: 1147000 kHz Device Overlap: Enabled Kernel Execution Timeout: Disabled

--- General information about Device 1 Name: Quadro 2000 Compute Capability: 2.1 Clock Rate: 1251000 kHz Device Overlap: Enabled Kernel Execution Timeout: Disabled

--- General information about Device 2 Name: Quadro 2000 Compute Capability: 2.1 Clock Rate: 1251000 kHz Device Overlap: Enabled Kernel Execution Timeout: Enabled`

I know that thrust needs to be compiled against the target architecture in order to work. Therefore, I set the CUDA device to 1. However, the error persists.

As a debugging measure, I placed a cudaGetDevice() call immediately before device_vector allocation. The device is correctly stated to be 1.

int device;
CUDA_CHECK_RETURN(cudaGetDevice(&device), __FILE__, __LINE__);
std::cout << "Operating on device " << device << std::endl; // <-- device 1

// copy the turns to the runtime
thrust::device_vector<MalfunctionTurn> d_turns = turns; // <-- error here

I'm at my wits end trying to debug this. Has anyone seen an error like this before? More notably, is there a limitation in cudaSetDevice() of which I'm not aware? I'm concerned because two identical cards on different machines cannot run the same code.

Thanks in advance.


EDIT

Compile command line: nvcc -rdc=true -arch=sm_21 -O3 file

Here is a minimal example that reproduces the error:

#define DEVICE __device__
#define HOST __host__

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

template <typename T, std::size_t N>
class Container {
public:

    DEVICE HOST
    Container() {

    }

private:
    T data[N];
};

typedef Container<double, 7> double7;

template <std::size_t N = 10 >
class History {
public:

    DEVICE HOST
    History() {

    }

    DEVICE HOST
    virtual ~History() {

    }

private:

    double7 history[N];
};

int main() {

    try {

        thrust::host_vector<History<> > histories(1);
        thrust::device_vector<History<> > d_histories = histories;
    } catch (const thrust::system_error &) {
        std::cerr << "boo boo" << std::endl;
    }

    return 0;
}
Tyler Jandreau
  • 4,245
  • 1
  • 22
  • 47
  • Post a short, complete code that demonstrates the problem, along with your compile command line. If it is as you describe, it should take about 10 lines of code. Can you run other codes normally (e.g. CUDA samples like `deviceQuery` and `vectorAdd`) on the machine that is failing with the thrust code? – Robert Crovella Sep 24 '14 at 19:58
  • Yes, I'm able to use the same machine for non-thrust CUDA stuff. I'll write a small program and append it to this question (if this question does not get closed first) – Tyler Jandreau Sep 24 '14 at 20:00
  • 1
    what is `JARSS.h` ? Can you make a self-contained example, rather than one that depends on a header file I don't have? This code runs correctly on one machine but not the other? – Robert Crovella Sep 24 '14 at 21:23
  • JARSS.h in this case only defines aliases for `__device__` and `__host__` Yes, this code throws that `thrust::system_error` on one machine and not the other. – Tyler Jandreau Sep 24 '14 at 21:56

1 Answers1

3

As near as I can tell, your code is (under the hood) violating a CUDA restriction on usage of classes with virtual functions:

It is not allowed to pass as an argument to a __global__ function an object of a class with virtual functions.

If I take your code and remove:

#include <JARSS.h>

and replace it with:

#define HOST   __host__
#define DEVICE __device__

I can compile it. However, under the hood, thrust performs this line:

    thrust::device_vector<History<> > d_histories = histories;

by launching a kernel that takes as parameters the objects on the host to be copied to the device. (You can verify the kernel launch using nvprof for example.) This is common behavior for thrust. The problem arises in that those objects with virtual destructors cannot be copied this way.

Contrary to what your question states, this code should not run correctly on any CUDA machine.

You can "fix" your code by commenting out the virtual destructor. If you need polymorphic behavior, the fix to your actual code will probably be more complicated than that. It should still be possible to use polymorphic object behavior with thrust, see this answer for a worked example.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Very impressive answer. Thank you for taking the time. You are correct: the code should not be running on either machine. That's a problem on my end. I didn't even think of this while debugging. Much appreciated. – Tyler Jandreau Sep 24 '14 at 22:02