vector addition CUDA

Question

I am new to CUDA and working on the first exercise which is vector addition

#include<stdio.h>
#include<stdlib.h>
#include<math.h>

// Compute vector sum C = A+B
//CUDA kernel. Each thread performes one pair-wise addition

__global__ void vecAddKernel(float *A, float *B, float *C, int n)
{
//Get our global thread ID
int i = blockDim.x*blockIdx.x+threadIdx.x;

if (i<n) C[i] = A[i] + B[i];
}

int main(int argc, char* argv[])
{

//Size of vectors
int n = 100000;

int size = n * sizeof(float);


//Host input vectors
float *h_A, *h_B;
//Host output vector
float *h_C;

//Device input vectors
float *d_A, *d_B;
//Device output vector
float *d_C;

//Allocate memory for each vector on host
h_A = (float*)malloc(sizeof(size));
h_B = (float*)malloc(sizeof(size));
h_C = (float*)malloc(sizeof(size));

//Allocate memory for each vector on GPU
cudaMalloc( (void **) &d_A, size);
cudaMalloc( (void **) &d_B, size);
cudaMalloc( (void **) &d_C, size);

//Copy host vectors to device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

int blockSize, gridSize;

//Number of threads in each block
blockSize = 1024;//Execute the kernel
vecAddKernel<<<gridSize,blockSize>>>(d_A, d_B, d_C, n);

//Synchronize threads
cudaThreadSynchronize();

//Copy array back to host
cudaMemcpy( h_C, d_C, size, cudaMemcpyDeviceToHost );


//Release device memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);

//Release host memory
free(h_A);
free(h_B);
free(h_C);

return 0;
}

The compilation was succeeded, but while running the code I get: `Segmentation fault (core dumped). I do not see where the issue is. I've tried to use nvprof, but it's not helpful in any fashion. Can anyone help me to figure out where I made a mistake?

score 2 · Accepted Answer · answered Sep 05 '18 at 02:38

These statements are not correct:

h_A = (float*)malloc(sizeof(size));
h_B = (float*)malloc(sizeof(size));
h_C = (float*)malloc(sizeof(size));

They should be:

h_A = (float*)malloc(size);
h_B = (float*)malloc(size);
h_C = (float*)malloc(size);

malloc takes a parameter which is the size in bytes to be allocated. So when you pass a value of size you are passing a value of 400000, which is correct. When you pass a value of sizeof(size) you are passing a value which is the size of the size variable in bytes (i.e. how much space does the size variable itself occupy, to store its number). That is a value of 4 (an int is 4 bytes).

So when you do cudaMemcpy operations targetting say, h_A and asking to transfer 400000 bytes to/from whatever h_A points to, and you've only provided an allocation of 4 bytes there (for whatever it points to) you are going to overrun that and get a seg fault.

This has little to do with CUDA.

As an aside, nvprof would not be the right tool to tackle this. nvprof is a profiler, and it expects that the code to be profiled is functionally correct. You don't use nvprof to debug problems like this. It is not a debugger.

Since this is purely a host code issue (as all seg faults are, even in CUDA) you could just use a host debugger like gdb to debug this. Of course you could similarly use a tool like cuda-gdb to debug this. Likewise if you are on windows, there are debuggers built into eg. visual studio.

Finally, it's not going to help much with this particular issue, but I always recommend that anytime you are having trouble with a CUDA code, that you use proper CUDA error checking and run your code with cuda-memcheck.

Thank you so much for your answer, Robert. It helps a lot! – Monica Sep 05 '18 at 02:47 — Monica, Sep 05 '18 at 02:47

vector addition CUDA

1 Answers1