Do While don't work inside CUDA Kernel

Question

Ok, I'm pretty new into CUDA, and I'm kind of lost, really lost.

I'm trying to calculate pi using the Monte Carlo Method, and at the end I just get one add instead of 50.

I don't want to "do while" for calling the kernel, since it's too slow. My issue is, that my code don't loop, it executes only once in the kernel.

And also, I'd like that all the threads access the same niter and pi, so when some thread hit the counters all the others would stop.

#define SEED 35791246

__shared__ int niter;
__shared__ double pi;

__global__ void calcularPi(){

    double x;
    double y;
    int count;
    double z;

    count = 0;
    niter = 0;

    //keep looping
    do{

        niter = niter + 1;

        //Generate random number
        curandState state;
        curand_init(SEED,(int)niter, 0, &state);
        x = curand(&state);
        y = curand(&state);

        z = x*x+y*y;
        if (z<=1) count++;
     pi =(double)count/niter*4;

    }while(niter < 50);

}

int main(void){

    float tempoTotal;
    //Start timer
    clock_t t;
    t = clock();

    //call kernel
    calcularPi<<<1,32>>>();

    //wait while kernel finish
    cudaDeviceSynchronize();

    typeof(pi) piFinal;
    cudaMemcpyFromSymbol(&piFinal, "pi", sizeof(piFinal),0, cudaMemcpyDeviceToHost);

    typeof(niter) niterFinal;
    cudaMemcpyFromSymbol(&niterFinal, "niter", sizeof(niterFinal),0, cudaMemcpyDeviceToHost);

    //Ends timer
    t = clock() - t;
    tempoTotal = ((double)t)/CLOCKS_PER_SEC;
    printf("Pi: %g \n", piFinal);
    printf("Adds: %d \n", niterFinal);
    printf("Total time: %f \n", tempoTotal);

}

I do not want to be rude but I must ask if you understand that all the threads execute the kernel? And that when you use shared memory is up to you to to avoid race conditions and other synchronisation problems? — terence hill, Nov 21 '15 at 21:07
every thread (of the 32 you are launching) is doing essentially the same thing. This means as threads write to `niter` and `pi` they will potentially be stepping on each other. Furthermore, since cuda 5.0 we don't use string parameters in the cudaMemcpy..Symbol operations. You're launching 32 threads. What do you want each one of them to do? Do you want a separate `niter` variable for each thread, or are they all supposed to update the same one? Do you want them to compute separate `pi` variables? Or do you want them all to work on the same one? — Robert Crovella, Nov 21 '15 at 21:08
Here is a tutorial on the pi calculus in CUDA https://www.olcf.ornl.gov/tutorials/cuda-monte-carlo-pi/ — terence hill, Nov 21 '15 at 21:12
also, `cudaMemcpy..Symbol` is not used on `__shared__` variables. — Robert Crovella, Nov 21 '15 at 21:16
You also need to seed your random number generator _outside_ of the loop. — void_ptr, Nov 21 '15 at 21:19
@RobertCrovella, I want that all the threads work on the same pi and the same niter. — GAltelino, Nov 21 '15 at 21:31
Btw, I don't care about the pi result, just need to show that the work is being done in all the threads. — GAltelino, Nov 21 '15 at 21:33

score 2 · Accepted Answer · edited May 23 '17 at 10:27

There are a variety of issues with your code.

I suggest using proper cuda error checking and run your code with cuda-memcheck to spot any runtime errors. I've omitted proper error checking in my code below for brevity of presentation, but I've run it with cuda-memcheck to indicate no runtime errors.
Your usage of curand() is probably not correct (it returns integers over a large range). For this code to work correctly, you want a floating-point quantity between 0 and 1. The correct call for that is curand_uniform().
Since you want all threads to work on the same values, you must prevent those threads from stepping on each other. One way to do that is to use atomic updates of the variables in question.
It should not be necessary to re-run curand_init on each iteration. Once per thread should be sufficient.
We don't use cudaMemcpy..Symbol operations on __shared__ variables. For convenience, and to preserve something that resembles your original code, I've elected to convert those to __device__ variables.

Here's a modified version of your code that has most of the above issues fixed:

$ cat t978.cu
#include <curand.h>
#include <curand_kernel.h>
#include <stdio.h>

#define ITER_MAX 5000
#define SEED 35791246

__device__ int niter;
__device__ int count;

__global__ void calcularPi(){

    double x;
    double y;
    double z;
    int lcount;
    curandState state;
    curand_init(SEED,threadIdx.x, 0, &state);
    //keep looping
    do{

        lcount = atomicAdd(&niter, 1);

        //Generate random number
        x = curand_uniform(&state);
        y = curand_uniform(&state);

        z = x*x+y*y;
        if (z<=1) atomicAdd(&count, 1);

    }while(lcount < ITER_MAX);

}

int main(void){

    float tempoTotal;
    //Start timer
    clock_t t;
    t = clock();
    int count_final = 0;
    int niter_final = 0;
    cudaMemcpyToSymbol(niter, &niter_final, sizeof(int));
    cudaMemcpyToSymbol(count, &count_final, sizeof(int));
    //call kernel
    calcularPi<<<1,32>>>();

    //wait while kernel finish
    cudaDeviceSynchronize();
    cudaMemcpyFromSymbol(&count_final, count, sizeof(int));
    cudaMemcpyFromSymbol(&niter_final, niter, sizeof(int));

    //Ends timer
    double pi = count_final/(double)niter_final*4;
    t = clock() - t;
    tempoTotal = ((double)t)/CLOCKS_PER_SEC;
    printf("Pi: %g \n", pi);
    printf("Adds: %d \n", niter_final);
    printf("Total time: %f \n", tempoTotal);

}
$ nvcc -o t978 t978.cu -lcurand
$ cuda-memcheck ./t978
========= CUDA-MEMCHECK
Pi: 3.12083
Adds: 5032
Total time: 0.558463
========= ERROR SUMMARY: 0 errors
$

I've modified the iterations to a larger number, but you can use 50 if you want for ITER_MAX.

Note that there are many criticisms that could be levelled against this code. My aim here, since it's clearly a learning exercise, is to point out what the minimum number of changes could be to get a functional code, using the algorithm you've outlined. As just one example, you might want to change your kernel launch config (<<<1,32>>>) to other, larger numbers, in order to more fully utilize the GPU.

ey, thanks! It really solved my issue. And also thanks for the critics, they're always welcome. — GAltelino, Nov 21 '15 at 22:01

Do While don't work inside CUDA Kernel

1 Answers1

Linked