-2

I am trying to run basic getting started examples for cuda/opencl GPU computing on Ubuntu 14 using a GeForce GTX 660M graphics card.

Even though I managed to compile and run the sample-code, it seems like the GPU isn't computing anything or the cudaMemcpy-operation doesn't work, since my result values are not updated after invoking the kernel and performing the DeviceToHost-copy operation.

I wonder, whether I need to install a certain native driver from nvidia on Ubuntu in order to use cuda or opencl.

That's my basic getting started code (for cuda):

#include <iostream>

using namespace std;

// global constants
#define THREADS 4

const int N = 100;

int fill_content = 1;

__global__ void sum(int* a, int* b, int* c)
{
    int i = blockIdx.x * blockDim.x * threadIdx.x;
    c[i] = a[i] + b[i];
}

void check( int* a, int N )
{
    cout << endl;

    for(int i = 0; i < N; ++i)
    {
        int num = a[i];
        cout << i << ": " << num << endl;
    }

    cout << endl;
}

void fill_vectors(int*p , int size)
{
    for(int i = 0; i < size; ++i)
    {
        p[i] = fill_content;
    }
}

int main(int argc, char **argv)
{
    int host_a[N], host_b[N], host_c[N];
    size_t s_a,s_b,s_c;
    s_a = s_b = s_c = sizeof(int) * N;
    int *dev_a, *dev_b, *dev_c;


    // allocate memory on the device for calculation input and results
    cudaMalloc(&dev_a, s_a);
    cudaMalloc(&dev_b, s_b);
    cudaMalloc(&dev_c, s_c);

    fill_content = 1;
    fill_vectors(host_a, N);

    fill_content = 2;
    fill_vectors(host_b, N);

    fill_content = 0;
    fill_vectors(host_c, N);

    // copy the input values to the gpu-memory
    cudaMemcpy(dev_a, host_a, s_a, cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, host_b, s_b, cudaMemcpyHostToDevice);

    // invokes kernel-method sum on device using device-memory dev_a, dev_b, dev_c
    //sum<<<N/THREADS, THREADS,1>>>(dev_a, dev_b, dev_c);

    // copy the result values back from the device_memory to the host-memory
    cudaMemcpy(host_c, dev_c, s_c, cudaMemcpyDeviceToHost);

    // free memory allocated on device (for input and result values)
    cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c);

    // expected to print out 3
    check(host_c,N);
}

I compile it with:

nvcc -o vector-sum2 vector-sum2.cu

With having nvidia-cuda-toolkit installed:

Like explained above it only outputs 0 for each array-element

0: 0
1: 0
2: 0
3: 0
4: 0
5: 0

... continuing.

Do you know, what I need to change in order for this example to work?

ArchLinuxTux
  • 840
  • 1
  • 11
  • 28

1 Answers1

4

First of all, your kernel call is commented out:

//sum<<<N/THREADS, THREADS,1>>>(dev_a, dev_b, dev_c);

So your output is all zero because you're not actually running the kernel.

If you uncomment the kernel, there are problems. Any time you're having trouble with a CUDA code, you should use proper cuda error checking and run your code with cuda-memcheck.

Uncommenting the kernel and running with cuda-memcheck reveals lots of out-of-bounds accesses by the kernel. These are ultimately due to this line of code:

int i = blockIdx.x * blockDim.x * threadIdx.x;

That is not the correct way to create a unique thread index. Instead we want:

int i = blockIdx.x * blockDim.x + threadIdx.x;

With those changes, your code runs correctly for me. If it still doesn't work for you, you may have a problem with machine setup, in which case the proper cuda error checking will likely give you some clues about it.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257