My motivation: I am using an algorithm to model population dynamics and I wish to use CUDA in order to be able to make consider large number of nodes in the numerical simulations. Although this is the first time I am running code on a GPU, results look promising so far.
Context: I need to account for stochastic noise, which plays a crucial role in the evolution of the complex systems I aim to study. As far as I understand, random number generation in CUDA can be quite troublesome compared to a similar operation on a CPU. In the documentation I see that one has to store the states of the RNG and keep feeding this to the kernel (global function) which needs to (generate and) use the random numbers. I found these examples quite enlightening, maybe there is something else you recommend me reading about this?
Question: What is the advantage of generating n seed values, store them in an array on the device global memory, and then feeding them to the kernel which in turn generates a couple of random numbers to use, opposed to generating 2n random numbers, storing them in the device global memory, and then feeding them directly to the kernel which needs to use them? I must be missing something truly important here because it certainly looks to me like one would save resources in the second case (which is never used in examples). It also seems that one would be safer regarding the distribution of the generated numbers.
My code is rather long but I tried to make a short example of what I need. Here it is:
My code:
#include <cstdlib>
#include <stdio.h>
#include <cuda.h>
#include <curand.h>
#include <math.h>
__global__ void update (int n, float *A, float *B, float p, float q, float *rand){
int idx = blockIdx.x*blockDim.x + threadIdx.x;
int n_max=n*n;
int i, j;
i=idx/n; //col
j=idx-i*n; //row
float status;
//A, B symmetric
//diagonal untouched, only need 2 random numbers per thread
//i.e. n*(n-1) random numbers in total
int idx_rand = (2*n-1-i)*i/2+j-1-i;
if(idx<n_max && j>i){
if(rand[idx_rand]<p){
status=A[idx];
if(status==1){
if(rand[idx_rand+n*(n-1)/2] < q){
B[idx]=-1.0f;
B[i+n*j]=-1.0f;
}
}
else if(status==0){
if(rand[idx_rand+n*(n-1)/2] < q){
B[idx]=1.0f;
B[i+n*j]=1.0f;
}
}
}
}
}
__global__ void fill(float *A, int n, float num){
int idx = blockIdx.x*blockDim.x + threadIdx.x;
if(idx<n){
A[idx]=num;
}
}
void swap(float** a, float** b) {
float* temp = *a;
*a = *b;
*b = temp;
}
int main(int argc, char* argv[]){
int t, n, t_max, seed;
seed = atoi(argv[1]);
n = atoi(argv[2]);
t_max = atoi(argv[3]);
int blockSize = 256;
int nBlocks = n*n/blockSize + ((n*n)%blockSize == 0?0:1);
curandGenerator_t prng;
curandCreateGenerator(&prng, CURAND_RNG_PSEUDO_DEFAULT);
curandSetPseudoRandomGeneratorSeed(prng, (unsigned long long) seed);
float *h_A = (float *)malloc(n * n * sizeof(float));
float *h_B = (float *)malloc(n * n * sizeof(float));
float *d_A, *d_B, *d_rand;
cudaMalloc(&d_A, n * n * sizeof(float));
cudaMalloc(&d_B, n * n * sizeof(float));
cudaMalloc(&d_rand, n * (n-1) * sizeof(float));
fill <<< nBlocks, blockSize >>> (d_A, n*n, 0.0f);
fill <<< nBlocks, blockSize >>> (d_B, n*n, 0.0f);
for(t=1; t<t_max+1; t++){
//generate random numbers
curandGenerateUniform(prng, d_rand, n*(n-1));
//update B
update <<< nBlocks, blockSize >>> (n, d_A, d_B, 0.5f, 0.5f, d_rand);
//do more stuff
swap(&d_A, &d_B);
}
cudaMemcpy(h_A, d_A, n*n*sizeof(float),cudaMemcpyDeviceToHost);
//print stuff
curandDestroyGenerator(prng);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_rand);
free(h_A);
free(h_B);
return 0;
}
I would like you to tell me what is wrong with it (and some hints on how to fix it). If the experts could tell me how much I can expect to save (in running time) in the best case scenario, after all performance tweaks they can think of, it would be great, because I have several tasks on my hands right now and cost-benefit in terms of "studying time" is therefore very important.
And this is it, thanks for reading!
Just for the record, my hardware specs are below. I plan to use Amazon EC2 for this at some point, though.
My (current) hardware:
Device 0: "GeForce 8800 GTX"
CUDA Driver Version / Runtime Version 5.5 / 5.5
CUDA Capability Major/Minor version number: 1.0
Total amount of global memory: 768 MBytes (804978688 bytes)
(16) Multiprocessors, ( 8) CUDA Cores/MP: 128 CUDA Cores
GPU Clock rate: 1350 MHz (1.35 GHz)
Memory Clock rate: 900 Mhz
Memory Bus Width: 384-bit
Maximum Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(8192), 512 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(8192, 8192), 512 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Max dimension size of a thread block (x,y,z): (512, 512, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 1)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: No
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 7 / 0