I am writing a basic CUDA program to get a better understanding of the language. I have written something very basic that just adds two vectors in parallel, and prints the results to a ppm file. Right now, the values within the vector are irrelevant, as I plan on adjusting that later to produce some type of interesting image. The issue is the resolution of the image (which is actually the result vector) causes the program to crash almost instantly if I make it too large. Consider the program as it is now:
#include <stdio.h>
#define cols 500
#define rows 50
#define arraySize rows * cols
__global__ void addOnGPU(int *a, int *b, int *c) {
// Only use data at this index
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < arraySize) c[tid] = a[tid] + b[tid];
}
int main()
{
FILE *ppm_fp;
int a[arraySize], b[arraySize], c[arraySize];
int *dev_a, *dev_b, *dev_c;
int i, j;
int threadsperblock = 256;
int blocks = (arraySize + threadsperblock - 1) / threadsperblock;
printf("1\n");
// Allocate memory on GPU for the three vectors
cudaError_t cudaStatus = cudaMalloc((void **) &dev_a, arraySize * sizeof(int));
cudaStatus = cudaMalloc((void **) &dev_b, arraySize * sizeof(int));
cudaStatus = cudaMalloc((void **) &dev_c, arraySize * sizeof(int));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "Unable to allocate memory on the GPU!");
return 1;
}
printf("2\n");
// Assign values to input vectors
for (i = 0, j = 0; i < arraySize; i++, j++) {
a[i] = i;
b[i] = i * i;
}
printf("3\n");
// Copy input values to allocated vectors in GPU memory
cudaStatus = cudaMemcpy(dev_a, a, arraySize * sizeof(int), cudaMemcpyHostToDevice);
cudaStatus = cudaMemcpy(dev_b, b, arraySize * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "Unable to copy input vectors to the GPU!");
return 1;
}
printf("before\n");
// Add vectors in parallel and save results in dev_c
addOnGPU<<<blocks, threadsperblock>>>(dev_a, dev_b, dev_c);
printf("after\n");
// Copy results from dev_c to local c vector
cudaStatus = cudaMemcpy(c, dev_c, arraySize * sizeof(int), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "Unable to copy input vectors to the GPU!");
return 1;
}
ppm_fp = fopen("image.ppm", "wb");
fprintf(ppm_fp, "P6\n%d %d\n255\n", cols, rows);
for (i = 0; i < arraySize; i++) {
if (i % (3 * cols) == 0) fprintf(ppm_fp, "\n");
fprintf(ppm_fp, "%d ", c[i]);
}
// Display contents of output vector
for (i = 0; i < arraySize; i++) {
printf("%d + %d = %d\n", a[i], b[i], c[i]);
}
printf("\n");
// cudaDeviceReset must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceReset failed!");
return 1;
}
return 0;
}
As it stands, the program runs fine with those values of cols and rows. If I increase rows to 500, then the program crashes. I have included a few debug print statements in an attempt to find where it crashes, but as soon as I run it it crashes. I am running it on Visual Studio 2013 (of which I am a novice using, and much more familiar with VI, linux, and manual compiling). I have a GTX 580 3GB version, if that matters. I know there is no way I am going over any memory limits, and I am not exceeding the 65536 (or is it 65535) limit of blocks that can created, or the 512 limit of threads per block. Any ideas on what is going wrong?
Thank you