0

I am making one program in CUDA C, I've solved the problem on classic way, but I should to parallelise the code using CUDA. The problem is: print all vectors of length n, in which each element can have a value of [0 ... K] and for which the sum of all elements is SUM.

I've wrote the program in CUDA C, and the program should return to me the numbers of vectors who satisfied the condition. Now the problem is that I can't to find any error in the code, I don't know how to debug in Ubuntu, and the output always give me 0. I think that the global function doesn't execute. This is the code, I hope that someone will help me:

The code of the program is:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda.h>

#define MIN(x, y) (((x) < (y)) ? (x) : (y))
#define MYASSERT(condition) if(!(condition)) { return; }

__device__ void distribute2 (int vec[], int n, int k, int sum)
{
    int i;
    for (i =  blockIdx.x * blockDim.x + threadIdx.x;; i < n; i+=blockDim.x * gridDim.x)
    {
        vec[i]=MIN(sum, k);
        sum = sum- vec[i];
    }
    MYASSERT (sum == 0);
}
__global__ void moveUp (int vec[], int n, int k, int *res)
{
    int i;
    int collected = 0;
    for(i=blockIdx.x * blockDim.x + threadIdx.x; i<n;i+=blockDim.x * gridDim.x)
    {
        if (collected == 0)
            collected = vec[i];
        else
        {
            if (vec[i] < k)
            {
                vec[i] =vec[i]+1;
                distribute2 (vec, i, k, collected-1);
                __synchthreads();
                res[0]=res[0]+1;
            }
            else
            {
                collected += k;
            }
        }

    }
    MYASSERT(collected != 0);
}
int main()
{
    int n=5;
    int vec[n];
    int k=5;
    int sum=10;

    int *res_h, *res_d;
    size_t size = 1 * sizeof(int);
    res_h = (int *)malloc(size);
    cudaMalloc((void **) &res_d, size);
    res_h[0] = 0;
    cudaMemcpy(res_d, res_h, size, cudaMemcpyHostToDevice);

    cudaDeviceProp devProp;
    cudaGetDeviceProperties(&devProp, 0);
    unsigned maxbytes = devProp.totalGlobalMem / 3;
    unsigned max_samples = maxbytes / sizeof(int);

    if (n > max_samples) n = max_samples;

    printf("Using %d samples to estimate pi\n", n);

    moveUp<<<256, 256>>>(vec, n, k, res_d);
    cudaMemcpy(res_h, res_d, size, cudaMemcpyDeviceToHost);
    printf("%d\n", res_h[0]);
    return 0;
}
Dragon
  • 111
  • 3
  • 12
  • 2
    Your code as posted does not compile and has syntax errors, which leads me to believe you have never run it or even successfully compiled it. Once you have fixed those errors, then add cuda error checking as described [here](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) and you will discover an error that you should focus on next. After you have discovered that error, run your code through `cuda-memcheck` to get some clues as to what is going wrong. Is this homework? – Robert Crovella Apr 14 '13 at 22:18
  • Maybe has some syntax errors because I was rewriting the code to show here as example. I am running the code on Ubuntu logged using shell. This is a part of my master studies, but I am new in CUDA and this is really strange for me. – Dragon Apr 15 '13 at 06:23
  • After you fix the syntax errors this code will run, but the kernel will not. If you instrument your code with cuda error checking, it will confirm that. If you run cuda-memcheck, it will show an out-of-bounds access in your kernel (`moveUp`) which should help you with how to fix it. – Robert Crovella Apr 15 '13 at 14:10
  • Can you show me example or link how to use error checking and `cuda-memcheck`? – Dragon Apr 17 '13 at 09:59
  • Cuda error checking is described [here](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). cuda-memcheck is described [here](https://developer.nvidia.com/cuda-memcheck). To run cuda-memcheck in linux, just type `cuda-memcheck` followed by the command to run your program, such as `cuda-memcheck ./myprogram` – Robert Crovella Apr 17 '13 at 15:18

1 Answers1

1

One problem in the code you have posted is that vec is a host pointer that you are dereferencing on the device. This is causing your kernel to abort.

You must handle vec in a similar fashion to the way you are handling res_h and res_d

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257