1

I am using CUDA to add two matrices, and to give their result in another matrix. I wish to make use of shared memory feature, and for this, I wrote the following:

#include <stdio.h>
#include <cuda.h>
#define grid 1024
#define BSZ 16

    __global__ void addition(int *dev_a, int *dev_b, int *dev_c)
    {

    __shared__ int as[BSZ][BSZ];
    __shared__ int bs[BSZ][BSZ];

    int by = blockIdx.y;
    int bx = blockIdx.x;

    int cvalue;

    int ty = threadIdx.y;
    int tx = threadIdx.x;

    int row = by * BSZ + ty;
    int col = bx * BSZ + tx;

    as[ty][tx] = dev_a[row*grid + col];
    bs[ty][tx] = dev_b[row*grid + col];
    __syncthreads();

    cvalue = as[ty][tx] + bs[ty][tx];

    __syncthreads();

    dev_c[row*grid + col] = cvalue;
    }

int main ()
{
    int a[grid][grid], b[grid][grid], c[grid][grid];
    //c = a + b
    for(int i=0;i<grid;i++)
    {
      for(int j=0;j<grid;j++)
      {
        a[i][j]=2;
        b[i][j]=1;
      }
    }

    printf("Working fine here");
    int *dev_a;
    int *dev_b;
    int *dev_c;
    int size = grid * grid * sizeof(int);

    printf("Working fine");
    cudaMalloc( (void**)&dev_a, size );
    cudaMalloc( (void**)&dev_b, size );
    cudaMalloc( (void**)&dev_c, size );

    cudaMemcpy(dev_a,a,size,cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b,b,size,cudaMemcpyHostToDevice);

    dim3 dimBlock(BSZ,BSZ);
    dim3 dimGrid(grid/dimBlock.x,grid/dimBlock.y);

    //Kernel launch
    addition<<<dimGrid, dimBlock>>>(dev_a, dev_b, dev_c);

    cudaMemcpy(c,dev_c,size,cudaMemcpyDeviceToHost);

    for (int i=0; i<grid; i++)
    {
      for(int j=0;j<grid;j++)
        {
          printf( "%d + %d = %d\n", a[i][j], b[i][j], c[i][j] );
        }
    }
}

I am getting a segmentation fault error, which I am not able to understand why! Please someone help me with this.

Tanmay Agrawal
  • 141
  • 1
  • 8
  • 4
    How many "working fine"s did you get? It would be right civilized of you if you could tell us how far you got in your troubleshooting, instead of leaving us guessing. – Robert Harvey May 21 '13 at 04:00
  • 1
    How about if you start by doing proper [cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) Sort out any errors reported there. Then, if you are still getting a seg fault, at least narrow the problem down to which line is producing the seg fault. This is standard debugging, not cuda-specific, and you can do it with printf, or a debugger, or whatever method you like. – Robert Crovella May 21 '13 at 04:10
  • I didn't get any working fine! The program compiled in nvcc, but when i executed it, it showed seg fault! I am new to cuda so I dont know much about macros. I tried to use HANDLE_ERROR but that is not working here in the GPU machine in the laboratory. – Tanmay Agrawal May 21 '13 at 04:12
  • I just noticed that when I changed the grid size to 512, the program was working!! Any clue why is this happening? – Tanmay Agrawal May 21 '13 at 04:20
  • @TanmayAgrawal: if you're on Linux, `gdb` (or `cuda-gdb`) and `valgrind` can help you track the error. – BenC May 21 '13 at 04:35

1 Answers1

8
int a[1024][1024], b[1024][1024], c[1024][1024];

The size of these objects is astronomical! You're probably overflowing the stack. I think you'll find the segfaults vanish if you reduce their sizes, or increase the size of your stack however your implementation permits you to do that, or perhaps even allocate them with dynamic storage duration (eg. malloc or in your case cudaMalloc) rather than automatic storage duration.

autistic
  • 1
  • 3
  • 35
  • 80
  • Yeah, sth like that is happening! I reduced grid size to 512, it worked! Any clue how to make it work for 1024? I am using cudaMalloc for device pointers. Should I use malloc for host pointers too? – Tanmay Agrawal May 21 '13 at 04:29
  • @TanmayAgrawal: You want to allocate memory on the heap (with `malloc()` for instance), not on the stack. You should read this answer: http://stackoverflow.com/a/79936/1043187. Note that your problem has nothing to do with CUDA, this is actually a C problem. – BenC May 21 '13 at 04:32
  • Use `malloc` or `cudaMalloc` to allocate such large objects for `a`, `b` and `c`... and make sure `sizeof (int) * 1024 * 1024` doesn't overflow `size_t`, otherwise you'll get much smaller objects than you wanted anyway... – autistic May 21 '13 at 04:34
  • @BenC Exactly, I got it. Thanks! – Tanmay Agrawal May 21 '13 at 04:35
  • @undefinedbehaviour Right! I will keep this in mind from now on. Thanks buddy! – Tanmay Agrawal May 21 '13 at 04:36