0

My motivation: I am using an algorithm to model population dynamics and I wish to use CUDA in order to be able to make consider large number of nodes in the numerical simulations. Although this is the first time I am running code on a GPU, results look promising so far.

Context: I need to account for stochastic noise, which plays a crucial role in the evolution of the complex systems I aim to study. As far as I understand, random number generation in CUDA can be quite troublesome compared to a similar operation on a CPU. In the documentation I see that one has to store the states of the RNG and keep feeding this to the kernel (global function) which needs to (generate and) use the random numbers. I found these examples quite enlightening, maybe there is something else you recommend me reading about this?

Question: What is the advantage of generating n seed values, store them in an array on the device global memory, and then feeding them to the kernel which in turn generates a couple of random numbers to use, opposed to generating 2n random numbers, storing them in the device global memory, and then feeding them directly to the kernel which needs to use them? I must be missing something truly important here because it certainly looks to me like one would save resources in the second case (which is never used in examples). It also seems that one would be safer regarding the distribution of the generated numbers.

My code is rather long but I tried to make a short example of what I need. Here it is:

My code:

    #include <cstdlib>
    #include <stdio.h>
    #include <cuda.h>
    #include <curand.h>
    #include <math.h>

    __global__ void update (int n, float *A, float *B, float p, float q, float *rand){

        int idx = blockIdx.x*blockDim.x + threadIdx.x;

        int n_max=n*n;

        int i, j;
        i=idx/n; //col
        j=idx-i*n; //row

        float status;

        //A, B symmetric
        //diagonal untouched, only need 2 random numbers per thread
        //i.e. n*(n-1) random numbers in total
        int idx_rand = (2*n-1-i)*i/2+j-1-i;

        if(idx<n_max && j>i){

            if(rand[idx_rand]<p){

                status=A[idx];

                if(status==1){
                    if(rand[idx_rand+n*(n-1)/2] < q){
                        B[idx]=-1.0f;
                        B[i+n*j]=-1.0f;

                    }
                }
                else if(status==0){
                    if(rand[idx_rand+n*(n-1)/2] < q){
                        B[idx]=1.0f;
                        B[i+n*j]=1.0f;

                    }
                }
            }

        }   
    }

    __global__ void fill(float *A, int n, float num){

        int idx = blockIdx.x*blockDim.x + threadIdx.x;

        if(idx<n){
            A[idx]=num;
        }
    }

    void swap(float** a, float** b) {

        float* temp = *a;
        *a = *b;
        *b = temp;
    }

    int main(int argc, char* argv[]){

        int t, n, t_max, seed;

        seed    = atoi(argv[1]);
        n   = atoi(argv[2]);
        t_max   = atoi(argv[3]);

        int blockSize = 256;
        int nBlocks = n*n/blockSize + ((n*n)%blockSize == 0?0:1);

        curandGenerator_t prng;
        curandCreateGenerator(&prng, CURAND_RNG_PSEUDO_DEFAULT);
        curandSetPseudoRandomGeneratorSeed(prng, (unsigned long long) seed);

        float *h_A  = (float *)malloc(n * n * sizeof(float));
        float *h_B  = (float *)malloc(n * n * sizeof(float));

        float *d_A, *d_B, *d_rand;  

        cudaMalloc(&d_A, n * n * sizeof(float));
        cudaMalloc(&d_B, n * n * sizeof(float));
        cudaMalloc(&d_rand, n * (n-1) * sizeof(float));

        fill <<< nBlocks, blockSize >>> (d_A, n*n, 0.0f);
        fill <<< nBlocks, blockSize >>> (d_B, n*n, 0.0f);

        for(t=1; t<t_max+1; t++){

            //generate random numbers
            curandGenerateUniform(prng, d_rand, n*(n-1));
            //update B
            update <<< nBlocks, blockSize >>> (n, d_A, d_B, 0.5f, 0.5f, d_rand);

            //do more stuff

            swap(&d_A, &d_B);

        }   

        cudaMemcpy(h_A, d_A, n*n*sizeof(float),cudaMemcpyDeviceToHost);
        //print stuff

        curandDestroyGenerator(prng);

        cudaFree(d_A);
        cudaFree(d_B);
        cudaFree(d_rand);
        free(h_A);
        free(h_B);

        return 0;
    }

I would like you to tell me what is wrong with it (and some hints on how to fix it). If the experts could tell me how much I can expect to save (in running time) in the best case scenario, after all performance tweaks they can think of, it would be great, because I have several tasks on my hands right now and cost-benefit in terms of "studying time" is therefore very important.

And this is it, thanks for reading!

Just for the record, my hardware specs are below. I plan to use Amazon EC2 for this at some point, though.

My (current) hardware:

    Device 0: "GeForce 8800 GTX"
    CUDA Driver Version / Runtime Version          5.5 / 5.5
    CUDA Capability Major/Minor version number:    1.0
    Total amount of global memory:                 768 MBytes (804978688 bytes)
    (16) Multiprocessors, (  8) CUDA Cores/MP:     128 CUDA Cores
    GPU Clock rate:                                1350 MHz (1.35 GHz)
    Memory Clock rate:                             900 Mhz
    Memory Bus Width:                              384-bit
    Maximum Texture Dimension Size (x,y,z)         1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
    Maximum Layered 1D Texture Size, (num) layers  1D=(8192), 512 layers
    Maximum Layered 2D Texture Size, (num) layers  2D=(8192, 8192), 512 layers
    Total amount of constant memory:               65536 bytes
    Total amount of shared memory per block:       16384 bytes
    Total number of registers available per block: 8192
    Warp size:                                     32
    Maximum number of threads per multiprocessor:  768
    Maximum number of threads per block:           512
    Max dimension size of a thread block (x,y,z): (512, 512, 64)
    Max dimension size of a grid size    (x,y,z): (65535, 65535, 1)
    Maximum memory pitch:                          2147483647 bytes
    Texture alignment:                             256 bytes
    Concurrent copy and kernel execution:          No with 0 copy engine(s)
    Run time limit on kernels:                     Yes
    Integrated GPU sharing Host Memory:            No
    Support host page-locked memory mapping:       No
    Alignment requirement for Surfaces:            Yes
    Device has ECC support:                        Disabled
    Device supports Unified Addressing (UVA):      No
    Device PCI Bus ID / PCI location ID:           7 / 0
dd_rlwll
  • 313
  • 5
  • 19

2 Answers2

5

In general, random number generation is a process which is amenable to parallelization on the GPU, and can therefore take advantage of the GPU to generate numbers more quickly, in many cases, than they can be generated on the CPU. This is the principal motivation to use an API/Library like CURAND.

What is the advantage of generating n seed values, store them in an array on the device global memory, and then feeding them to the kernel which in turn generates a couple of random numbers to use, opposed to generating 2n random numbers, storing them in the device global memory, and then feeding them directly to the kernel which needs to use them?

Both are valid approaches and can take advantage of GPU acceleration: either generate the numbers up front and store them, or else generate them on the fly.

Some of the reasons you might want to consider one approach over the other are:

  1. Generating the numbers up front is only useful if you know how many (or an upper bound on how many) you will need. If your algorithm is quite variable (perhaps in the presence of different data sets) this might be difficult to determine.
  2. Storage for the numbers generated could be an issue. For some types of algorithms (e.g. Monte Carlo simulation) it might be necessary to generate so many random numbers, that doing it up front and storing them all might be prohibitive. In these cases, generating them on the fly might allow you to bypass the need for large amounts of random number storage.
  3. It might be possible to get slightly better machine utilization by generating the numbers on the fly, so as to avoid the overhead of an additional kernel call to generate the numbers prior to them being used.

Again, CURAND's principal benefit is performance. If random number generation is a small part of the overall computational cost of your application, it may not matter which approach you use, or even if you use CURAND at all (e.g. in lieu of an an ordinary CPU based method of RNG).

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • The random number generation is not a small part of overall computational cost of my application, I hope I did not give that impression in my post. However, I know exactly how many random numbers I need per time step. This number scales with n^2, being typical n <5000. This is why I decided (naively?) to implement the algorithm in this way. I also thought that using the GPU would save me a lot of time because it writes faster to global device memory than the host, and the update() kernel runs on the device (bear in mind that the whole block of random numbers is generated once per time step). – dd_rlwll Oct 30 '13 at 20:21
  • I didn't say there was anything wrong with your method. I'm trying to give general guidance. If the number of random numbers needed is well-defined, I don't think there will be much perf difference between generating them up front or on-the-fly. Certainly the GPU writes to GPU memory faster than the host, and so RNG on the GPU makes sense for numbers that will be consumed in GPU code. – Robert Crovella Oct 30 '13 at 20:28
  • "I'm trying to give general guidance." Which I appreciated very much, thanks! :) – dd_rlwll Oct 30 '13 at 20:33
3

When using cuRand host APIs as shown in your code, you have to first generate some random numbers, store them in global mem, then load them into the working kernel. Storing and loading will cost some tome, thus may hurt the performance.

However if you use device APIs as shown in your link, you could save the bandwidth for storing and loading those random numbers.

In both cases, you will have to load and store same amount of RNG status data.

kangshiyin
  • 9,681
  • 1
  • 17
  • 29
  • "Storing and loading will take some bandwidth, thus may hurt the performance." If the code generates all the numbers it needs upfront, one seed value is enough. Otherwise, this does not apply and one has to constantly feed a new seed and status to the kernel. Isn't this seed-feeding process coming from global memory and bandwidth consuming as well? I am sorry if I am confusing everything here. Thanks a lot for your reply. – dd_rlwll Oct 30 '13 at 20:10
  • Sorry, pressed the wrong button and did not add my actual comment. Please read the edit. :) – dd_rlwll Oct 30 '13 at 20:15
  • There is no difference in seeding or state generation characteristics between the two methods. Using a multithreaded approach, most GPU RNGs requires one state structure per thread, which is updated for each new random number generated by that thread. The initial generation of the state structure consumes the random seed for that thread. This is true whether you are using the host API or the device API, and whether you are generating and storing, or generating on the fly. – Robert Crovella Oct 30 '13 at 20:23
  • The exact same seed-feeding process exists in host API call. That's why there's a variable of the type `curandGenerator_t`. It holds the whole RNG state, rather than a single seed value. But it can be initialized by a single seed value. And one seed value is not enough for parallel random number generation. – kangshiyin Oct 30 '13 at 20:24
  • "That's why there's a variable of the type curandGenerator_t. It holds the whole RNG state, rather than a single seed value." This is what I was missing. Thanks a lot Eric and Robert! How can I mark both your replies as answering my question? I upvoted both already. – dd_rlwll Oct 30 '13 at 20:26