1

I've created a simple but complete program basing on this tutorial: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory

#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

//Kernel definition
__global__ void VecAdd(float* A, float* B, float* C,int N)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if(i < N)
        C[i] = A[i] + B[i];
}

//Host code
int main()
{
    int N = 1000;
    int i;
    FILE *f;
    size_t size = N * sizeof(float);

    //allocate input vectors h_A and h_B in host memory
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);  

    //Initialize input vectors
    f = fopen("A.txt","r");
    for(i=0;i<N;i++)
        fscanf(f,"%f ",&h_A[i]);
    fclose(f);  
    f = fopen("B.txt","r");
    for(i=0;i<N;i++)
        fscanf(f,"%f ",&h_B[i]);
    fclose(f);
    //Allocate vactors in device memory
    float *d_A;
    gpuErrchk(cudaMalloc(&d_A,size));
    float *d_B;
    cudaMalloc(&d_B,size);
    float *d_C;
    cudaMalloc(&d_C,size);

    gpuErrchk(cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice));
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    //invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

    VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    f = fopen("C.txt","w");
    printf("%f \n",h_C[i]); 
    for(i=0;i<1000;i++)
        fprintf(f,"%f ",h_C[i]); 
    fclose(f);
    printf("Zakonczono obliczenia\n");
    // Free device memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    //Free host memory
    free(h_A);
    free(h_B);
    return 0;
}

It should read two vectors from files, add them on device and then print the output into 'C.txt' file. However, it prints one thousand of zeros.

After a little debugging did I find the culprit- the cudaMalloc function.

(cuda-gdb) n
42      cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
(cuda-gdb) n
43      cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
(cuda-gdb) print d_A[0]
$1 = 0
(cuda-gdb) print h_A[0]
$2 = 3.66192293

I wonder why it doesn't work, this part of code had been raw copied from the tutorial.

0x6B6F77616C74
  • 2,559
  • 7
  • 38
  • 65
  • What is `cudaMemcpyHostToDevice`? A callback? – Fiddling Bits Apr 23 '15 at 23:44
  • @FiddlingBits No, it's a member of 'cudaMemcpyKind' enum. It describes the type of memory transfer. – 0x6B6F77616C74 Apr 23 '15 at 23:47
  • 1
    add [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) to your code, and run your code with `cuda-memcheck`. Your thread index variable `i` in the kernel will also not let you scale to more than one threadblock's worth of data. You don't seem to have copied that from the "tutorial" correctly. – Robert Crovella Apr 24 '15 at 00:20
  • @RobertCrovella Kernel definition and invocation updated. – 0x6B6F77616C74 Apr 24 '15 at 00:36
  • 2
    Your code works for me as posted. The only thing that is a little squirrely is this line: `printf("%f \n",h_C[i]);` I think it should be something like this: `printf("%f \n",h_C[0]);` But I created `A.txt` and `B.txt` files composed of 1000 lines of `0.2` and the resultant `C.txt` file had a single line of 1000 iterations of `0.400000`. And `cuda-memcheck` reports no errors (in my case). So if your `C.txt` has garbage then I suspect a machine configuration issue (CUDA not functioning). If you add the proper error checking I linked to, it will probably shed some light on that. – Robert Crovella Apr 24 '15 at 03:55
  • I added the assertions as you advised. There are no errors on stderr. – 0x6B6F77616C74 Apr 24 '15 at 11:18

1 Answers1

0

It seems that I had to restart my computer after installing CUDA, not only log out and log in. After that my program works, but I can no longer debug it. There is

[Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". fatal: All CUDA devices are used for display and cannot be used while debugging. (error code = 24).

just after 'run'.

0x6B6F77616C74
  • 2,559
  • 7
  • 38
  • 65
  • If you cannot solve this error, you might be able to by-pass it, by using the onboard (integrated) gpu (if you have any) for display, and your device for computations. – Aperture Laboratories Apr 27 '15 at 09:32