4

My monte carlo pi calculation CUDA program is causing my nvidia driver to crash when I exceed around 500 trials and 256 full blocks. It seems to be happening in the monteCarlo kernel function.Any help is appreciated.

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>


#define NUM_THREAD 256
#define NUM_BLOCK 256



///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////

// Function to sum an array
__global__ void reduce0(float *g_odata) {
extern __shared__ int sdata[];

// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_odata[i];
__syncthreads();

// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2) { // step = s x 2
    if (tid % (2*s) == 0) { // only threadIDs divisible by the step participate
        sdata[tid] += sdata[tid + s];
    }
    __syncthreads();
}

// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
__global__ void monteCarlo(float *g_odata, int  trials, curandState *states){
//  unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int incircle, k;
    float x, y, z;
    incircle = 0;

    curand_init(1234, i, 0, &states[i]);

    for(k = 0; k < trials; k++){
        x = curand_uniform(&states[i]);
        y = curand_uniform(&states[i]);
        z =(x*x + y*y);
        if (z <= 1.0f) incircle++;
    }
    __syncthreads();
    g_odata[i] = incircle;
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
int main() {

    float* solution = (float*)calloc(100, sizeof(float));
    float *sumDev, *sumHost, total;
    const char *error;
    int trials; 
    curandState *devStates;

    trials = 500;
    total = trials*NUM_THREAD*NUM_BLOCK;

    dim3 dimGrid(NUM_BLOCK,1,1); // Grid dimensions
    dim3 dimBlock(NUM_THREAD,1,1); // Block dimensions
    size_t size = NUM_BLOCK*NUM_THREAD*sizeof(float); //Array memory size
    sumHost = (float*)calloc(NUM_BLOCK*NUM_THREAD, sizeof(float));

    cudaMalloc((void **) &sumDev, size); // Allocate array on device
    error = cudaGetErrorString(cudaGetLastError());
    printf("%s\n", error);


    cudaMalloc((void **) &devStates, (NUM_THREAD*NUM_BLOCK)*sizeof(curandState));
    error = cudaGetErrorString(cudaGetLastError());
    printf("%s\n", error);


    // Do calculation on device by calling CUDA kernel
    monteCarlo <<<dimGrid, dimBlock>>> (sumDev, trials, devStates);
    error = cudaGetErrorString(cudaGetLastError());
    printf("%s\n", error);

        // call reduction function to sum
    reduce0 <<<dimGrid, dimBlock, (NUM_THREAD*sizeof(float))>>> (sumDev);
    error = cudaGetErrorString(cudaGetLastError());
    printf("%s\n", error);

    dim3 dimGrid1(1,1,1);
    dim3 dimBlock1(256,1,1);
    reduce0 <<<dimGrid1, dimBlock1, (NUM_THREAD*sizeof(float))>>> (sumDev);
    error = cudaGetErrorString(cudaGetLastError());
    printf("%s\n", error);

    // Retrieve result from device and store it in host array
    cudaMemcpy(sumHost, sumDev, sizeof(float), cudaMemcpyDeviceToHost);
    error = cudaGetErrorString(cudaGetLastError());
    printf("%s\n", error);


    *solution = 4*(sumHost[0]/total);
    printf("%.*f\n", 1000, *solution);
    free (solution);
    free(sumHost);
    cudaFree(sumDev);
    cudaFree(devStates);
    //*solution = NULL;
    return 0;
}
zetatr
  • 179
  • 1
  • 3
  • 8

2 Answers2

8

If smaller numbers of trials work correctly, and if you are running on MS Windows without the NVIDIA Tesla Compute Cluster (TCC) driver and/or the GPU you are using is attached to a display, then you are probably exceeding the operating system's "watchdog" timeout. If the kernel occupies the display device (or any GPU on Windows without TCC) for too long, the OS will kill the kernel so that the system does not become non-interactive.

The solution is to run on a non-display-attached GPU and if you are on Windows, use the TCC driver. Otherwise, you will need to reduce the number of trials in your kernel and run the kernel multiple times to compute the number of trials you need.

EDIT: According to the CUDA 4.0 curand docs(page 15, "Performance Notes"), you can improve performance by copying the state for a generator to local storage inside your kernel, then storing the state back (if you need it again) when you are finished:

curandState state = states[i];

for(k = 0; k < trials; k++){
    x = curand_uniform(&state);
    y = curand_uniform(&state);
    z =(x*x + y*y);
    if (z <= 1.0f) incircle++;
}

Next, it mentions that setup is expensive, and suggests that you move curand_init into a separate kernel. This may help keep the cost of your MC kernel down so you don't run up against the watchdog.

I recommend reading that section of the docs, there are several useful guidelines.

harrism
  • 26,505
  • 2
  • 57
  • 88
  • I am running windows with my GPU attached to the display. I'm still surprised it would take so long for the kernel to complete. Could the curand_init and curand_uniform calls be the cause? – zetatr May 31 '11 at 02:16
  • Should be easy to find out -- replace the calls to `curand_uniform` with `1.0f`, and comment out `curand_init`. BTW, you don't need that `__syncthreads()`. – harrism May 31 '11 at 02:26
  • 1
    Thanks for notifying me about the sync. Also, ya the curand_uniform seems to be make the kernel take significantly longer to finish. It's a shame too since I am not even getting good convergence with the current amount of trials. Running more kernels would allow me to get better precision but the program would take so much longer for an unsatisfying amount of correct digits. – zetatr May 31 '11 at 02:38
  • 2
    I added some performance tips from the docs to my answer -- I bet you can get the time down, this shouldn't be an expensive kernel -- curand_uniform is only a few flops, and if you keep the state in a local variable, it will be kept in a register. I'm guessing the real expense is curand_init(), which the compiler probably dead-code-eliminated when you commented out curand_uniform(), making it seem like curand_uniform was expensive. Move curand_init into a separate kernel and state into a local variable and you should be much better off. You may want separate state for x and y though... – harrism May 31 '11 at 03:11
  • 1
    Thanks! Those tips helped out a lot. Putting the curand_init into a separate kernel allowed me to increase the number of trials by a couple orders of magnitude. Also I create a separate state array for y with a different seed value and curand_init call. This increased the runtime a little but gave me at least 1 extra digit than how it was earlier. Although it seems this monte carlo is still extremely slow to converge as I only have 4 correct digits with over 1.3billion total trials. – zetatr May 31 '11 at 03:56
  • I'm no expert on MC, but have you experimented with different generators, such as one of the Quadirandom generators? If you found my answer helpful, please accept it. – harrism May 31 '11 at 04:09
  • I'd just like to add here that when running Linux, in addition to running on a non-display-attached GPU, the X-server must also be shut down. In my case I was using lightdm in Ubuntu, so the following command was needed: `sudo service lightdm stop` – Adam27X Jan 23 '14 at 15:51
  • I don't believe that shutting down the X server is a requirement if you are running CUDA on a non-display GPU (such as a Tesla). – harrism Jan 23 '14 at 21:25
6

For those of you having a geforce GPU which does not support TCC driver there is another solution based on:

http://msdn.microsoft.com/en-us/library/windows/hardware/ff569918(v=vs.85).aspx

  1. start regedit,
  2. navigate to HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
  3. create new DWORD key called TdrLevel, set value to 0,
  4. restart PC.

Now your long-running kernels should not be terminated. This answer is based on:

Modifying registry to increase GPU timeout, windows 7

I just thought it might be useful to provide the solution here as well.

Community
  • 1
  • 1
Michal Hosala
  • 5,570
  • 1
  • 22
  • 49