0

I am writing a basic CUDA program to get a better understanding of the language. I have written something very basic that just adds two vectors in parallel, and prints the results to a ppm file. Right now, the values within the vector are irrelevant, as I plan on adjusting that later to produce some type of interesting image. The issue is the resolution of the image (which is actually the result vector) causes the program to crash almost instantly if I make it too large. Consider the program as it is now:

#include <stdio.h>

#define cols 500
#define rows 50
#define arraySize rows * cols

__global__ void addOnGPU(int *a, int *b, int *c) {
    // Only use data at this index
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < arraySize) c[tid] = a[tid] + b[tid];
}

int main()
{
    FILE *ppm_fp;
    int a[arraySize], b[arraySize], c[arraySize];
    int *dev_a, *dev_b, *dev_c;
    int i, j;
    int threadsperblock = 256;
    int blocks = (arraySize + threadsperblock - 1) / threadsperblock;

    printf("1\n");
    // Allocate memory on GPU for the three vectors
    cudaError_t cudaStatus = cudaMalloc((void **) &dev_a, arraySize * sizeof(int));
    cudaStatus = cudaMalloc((void **) &dev_b, arraySize * sizeof(int));
    cudaStatus = cudaMalloc((void **) &dev_c, arraySize * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "Unable to allocate memory on the GPU!");
        return 1;
    }

    printf("2\n");
    // Assign values to input vectors
    for (i = 0, j = 0; i < arraySize; i++, j++) {
        a[i] = i;
        b[i] = i * i;
    }

    printf("3\n");
    // Copy input values to allocated vectors in GPU memory
    cudaStatus = cudaMemcpy(dev_a, a, arraySize * sizeof(int), cudaMemcpyHostToDevice);
    cudaStatus = cudaMemcpy(dev_b, b, arraySize * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "Unable to copy input vectors to the GPU!");
        return 1;
    }

    printf("before\n");
    // Add vectors in parallel and save results in dev_c
    addOnGPU<<<blocks, threadsperblock>>>(dev_a, dev_b, dev_c);
    printf("after\n");

    // Copy results from dev_c to local c vector
    cudaStatus = cudaMemcpy(c, dev_c, arraySize * sizeof(int), cudaMemcpyDeviceToHost);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "Unable to copy input vectors to the GPU!");
        return 1;
    }

    ppm_fp = fopen("image.ppm", "wb");
    fprintf(ppm_fp, "P6\n%d %d\n255\n", cols, rows);
    for (i = 0; i < arraySize; i++) {
        if (i % (3 * cols) == 0) fprintf(ppm_fp, "\n");
        fprintf(ppm_fp, "%d ", c[i]);
    }

    // Display contents of output vector
    for (i = 0; i < arraySize; i++) {
        printf("%d + %d = %d\n", a[i], b[i], c[i]);
    }
    printf("\n");

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    return 0;
}

As it stands, the program runs fine with those values of cols and rows. If I increase rows to 500, then the program crashes. I have included a few debug print statements in an attempt to find where it crashes, but as soon as I run it it crashes. I am running it on Visual Studio 2013 (of which I am a novice using, and much more familiar with VI, linux, and manual compiling). I have a GTX 580 3GB version, if that matters. I know there is no way I am going over any memory limits, and I am not exceeding the 65536 (or is it 65535) limit of blocks that can created, or the 512 limit of threads per block. Any ideas on what is going wrong?

Thank you

halexh
  • 3,021
  • 3
  • 19
  • 19
  • 3
    I suspect that you are reaching the limits of the statically allocatable memory. Try changing the static allocations `a[arraySize]`, `b[arraySize]` and `c[arraySize]` to dynamic ones using `malloc`. – Vitality Nov 21 '13 at 17:34
  • I'm agree with @JackOLantern. You code worked fine for `500 rows and columns` in my Ubuntu machine, but failed when I increased the rows to `5000` – pQB Nov 21 '13 at 17:37
  • 1
    @JackOLantern if you post an answer I will upvote. – Robert Crovella Nov 21 '13 at 17:53

1 Answers1

5

The crash you are observing is not related to CUDA and is due to reached memory limits by the C/C++ static array allocations

int a[arraySize], b[arraySize], c[arraySize];

Statically allocated arrays are put into the memory stack which has in general size restrictions. Arrays dynamically allocated by the syntax

int* a = (int*)malloc(arraySize*sizeof(int));

are put into the memory heap which in general can grow during program execution as more memory is required. Opposite to that, heap memory is slower than stack memory due to the overhead of managing dynamic memory allocations.

You can find much useful material in the web explaining the differences between stack and heap memories, see for example

Memory : Stack vs Heap

and the StackOverflow protected question

What and where are the stack and heap?

As a closing remark, let me say that it is always good to do a proper CUDA error check in the sense of the post

What is the canonical way to check for errors using the CUDA runtime API?

This is now mentioned also in the CUDA Tag Wiki. It would have probably helped you ruling out CUDA errors by yourself.

Community
  • 1
  • 1
Vitality
  • 20,705
  • 4
  • 108
  • 146