0

I want to use CUDA 5.0 linking to write re-usable CUDA objects. i've set up this simple test of but my kernel fails silently (runs without error or exception and outputs junk).

My simple test (below) allocates an array of integers to CUDA device memory. The CUDA kernel should populate the array with sequential entries (0,1,2,....,9). The device array is copied to CPU memory and output to the console.

Currently, this code outputs "0,0,0,0,0,0,0,0,0," instead of the desired "0,1,2,3,4,5,6,7,8,9,". It is compiled using VS2010 and CUDA 5.0 (with compute_35 and sm_35 set). Running on Win7-64-bit with a GeForce 580.

In Test.h:

class Test
{
public:
    Test();
    ~Test();
    void Run();
private:
    int* cuArray;
};

In Test.cu:

#include <stdio.h>
#include <assert.h>
#include <cuda_runtime.h>

#include "Test.h"

#define ARRAY_LEN 10


__global__ void kernel(int *p)
{
    int elemID = blockIdx.x * blockDim.x + threadIdx.x;
    p[elemID] = elemID;
}

Test::Test() 
{
    cudaMalloc(&cuArray, ARRAY_LEN * sizeof(int));
}


Test::~Test() 
{
    cudaFree(cuArray);
}


void Test::Run()
{
    kernel<<<1,ARRAY_LEN>>>(cuArray);
    // Copy the array contents to CPU-accessible memory
    int cpuArray[ARRAY_LEN];
    cudaMemcpy(static_cast<void*>(cpuArray), static_cast<void*>(cuArray), ARRAY_LEN * sizeof(int), cudaMemcpyDeviceToHost);

    // Write the array contents to console
    for (int i = 0; i < ARRAY_LEN; ++i)
        printf("%d,", cpuArray[i]);
    printf("\n");
}

In main.cpp:

#include <iostream>
#include "Test.h"
int main()
{

    Test t;
    t.Run();
}

I've experimented with the DECLs (__device__ __host__) as suggested by @harrism but to no effect.

Can anyone suggest how to make his work? (The code works when it isn't inside a class.)

axon
  • 1,190
  • 6
  • 16
  • 1
    The GTX580 does not support sm_35, so I'm guessing you are running this on a GTX680 (kepler). – Oguz Meteer Apr 02 '13 at 10:00
  • Thanks. I'll check the docs and see what I can do (highest sm my card supports). I've def got a 580. Will post the result tomorrow. – axon Apr 02 '13 at 10:05
  • 1
    @OguzMeteer, You can't just guess. Also, GTX 680 is Compute 3.0, not 3.5. – sgarizvi Apr 02 '13 at 11:03
  • 1
    You say your code runs without error or exception, but you are not doing any cuda error checking so you wouldn't know. – Robert Crovella Apr 02 '13 at 11:48
  • @robert I do, because exceptions are reported from the Visual Studio console... and for simplicity I took out the CHECKED_CALL() macro from the posted version of the code. Thanks anyway. – axon Apr 02 '13 at 22:47
  • There's something wrong with your error checking then. You should probably [review how to do it](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). A kernel that does not run because it is compiled for the wrong device will throw an error when you try to launch it. You cannot use a CHECKED_CALL() type macro directly on a kernel launch to catch this type of error. – Robert Crovella Apr 03 '13 at 13:54

1 Answers1

1

The device you are using is GTX 580 whose compute capability is 2.0. If you compile the code for any architecture greater than 2.0, the kernel will not run on your device, and the output will be garbage. Compile the code for compute 2.0 or lower, and the code will run fine.

sgarizvi
  • 16,623
  • 9
  • 64
  • 98
  • Spot on. I checked the NVCC documentation and this was the cause of my troubles. Thanks to OguzMeteer for spotting this. – axon Apr 02 '13 at 22:46