1

Is it possible to generate random numbers within a device function without preallocate all the states? I would like to generate and use them in "realtime". I need to use them for Monte Carlo simulations what are the most suitable for this purpose? The number generated below are single precision is it possible to have them in double precision?

#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>

__global__ void cudaRand(float *d_out, unsigned long seed)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    curandState state;
    curand_init(seed, i, 0, &state);
    d_out[i] = curand_uniform(&state);
}

int main(int argc, char** argv)
{
    size_t N = 1 << 4;
    float *v = new float[N];

    float *d_out;
    cudaMalloc((void**)&d_out, N * sizeof(float));

    // generate random numbers
    cudaRand << < 1, N >> > (d_out, time(NULL));

    cudaMemcpy(v, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);

    for (size_t i = 0; i < N; i++)
    {
        printf("out: %f\n", v[i]);
    }

    cudaFree(d_out);
    delete[] v;

    return 0;
}

UPDATE

#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
#include <ctime>

__global__ void cudaRand(double *d_out)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    curandState state;
    curand_init((unsigned long long)clock() + i, 0, 0, &state);

    d_out[i] = curand_uniform_double(&state);
}

int main(int argc, char** argv)
{
    size_t N = 1 << 4;
    double *h_v = new double[N];

    double *d_out;
    cudaMalloc((void**)&d_out, N * sizeof(double));

    // generate random numbers
    cudaRand << < 1, N >> > (d_out);

    cudaMemcpy(h_v, d_out, N * sizeof(double), cudaMemcpyDeviceToHost);

    for (size_t i = 0; i < N; i++)
        printf("out: %f\n", h_v[i]);

    cudaFree(d_out);
    delete[] h_v;

    return 0;
}
Nerva
  • 329
  • 4
  • 13
  • 1
    I would say it is not recommandable to call `curand_init` and `curand_uniform_double` from within the same kernel for two reasons. First, you are generating pseudorandom numbers with different seeds. From the cuRAND guide: _Sequences generated with different seeds usually do not have statistically correlated values, but some choices of seeds may give statistically correlated sequences_. Second, `curand_init` initializes the pseudorandom number generator and sets all of its parameters, so I'm afraid your approach will be somewhat slow. – Vitality Oct 30 '14 at 14:33
  • @JackOLantern thank you, so is it impossible in your opinion to spawn random numbers within a device function without preallocation? – Nerva Oct 30 '14 at 17:05
  • I believe @JackOLantern is only worried about: 1. performance because of expensive calls to `curand_init` 2. about correctness of random numbers because of using different values for `seed` parameter while uzing zeroes for both `offset` and `subsequence`. My updated answer hopefully cover this topic (it is _not_ impossible). – Michal Hosala Oct 30 '14 at 18:22

1 Answers1

6

How I was dealing with the similar situation in the past, within __device__/__global__ function:

int tId = threadIdx.x + (blockIdx.x * blockDim.x);
curandState state;
curand_init((unsigned long long)clock() + tId, 0, 0, &state);

double rand1 = curand_uniform_double(&state);
double rand2 = curand_uniform_double(&state);

So just use curand_uniform_double for generating random doubles and also I believe you don't want the same seed for all of the threads, thats what I am trying to achieve by using clock() + tId instead. This way the odds of having the same rand1/rand2 in any of the two threads are close to nil.

EDIT:

However, based on below comments, proposed approach may perhaps lead to biased result:

  • JackOLantern pointed me to this part of curand documentation:

    Sequences generated with different seeds usually do not have statistically correlated values, but some choices of seeds may give statistically correlated sequences.

  • Also there is a devtalk thread devoted to how to improve performance of curand_init in which the proposed solution to speed up the curand initialization is:

    One thing you can do is use different seeds for each thread and a fixed subsequence of 0 and offset of 0.

    But the same poster is later stating:

    The downside is that you lose some of the nice mathematical properties between threads. It is possible that there is a bad interaction between the hash function that initializes the generator state from the seed and the periodicity of the generators. If that happens, you might get two threads with highly correlated outputs for some seeds. I don't know of any problems like this, and even if they do exist they will most likely be rare.

So it is basically up to you whether you want better performance (as I did) or 1000% unbiased results. If that is what you desire, then solution proposed by JackOLantern is the correct one, i.e. initialize curand as:

curand_init((unsigned long long)clock(), tId, 0, &state)

Using not 0 value for offset and subsequence parameters is, however, decreasing performance. For more info on these parameters you may review this SO thread and also curand documentation.

I see that JackOLantern stated in comment that:

I would say it is not recommandable to call curand_init and curand_uniform_double from within the same kernel from two reasons ........ Second, curand_init initializes the pseudorandom number generator and sets all of its parameters, so I'm afraid your approach will be somewhat slow.

I was dealing with this in my thesis on several pages, tried various approaches to get different random numbers in each thread and creating curandState in each of the threads turned out to be the most viable solution for me. I needed to generate ~10 random numbers in each thread and among others I tried:

  • developing my own simple random number generator (Linear Congruential Generator) whose intialization was basically for free, however, the performance suffered greatly when generating numbers, so in the end having curandState in each thread turned out to be superior,
  • pre-allocating curandStates and reusing them - this was memory heavy and when I decreased number of preallocated states then I had to use non zero values for offset/subsequence parameters of curand_uniform_double in order to get rid of bias which led to decreased performance when generating numbers.

So after making thorough analysis I decided to indeed call curand_init and curand_uniform_double in each thread. The only problem was with the amount of registry that these states were occupying so I had to be careful with the block sizes not to exceed the max number of registry available to each block.

Thats what I have to say about provided solution which I was finally able to test and it is working just fine on my machine/GPU. I run the code from UPDATE section in the above question and 16 different random numbers were displayed in the console correctly. Therefore I advise you to properly perform error checking after executing kernel to see what went wrong inside. This topic is very well covered in this SO thread.

Community
  • 1
  • 1
Michal Hosala
  • 5,570
  • 1
  • 22
  • 49
  • thank you have understood perfectly! Why I still get all zeros with my updated code in your opinion? I wanted to check the generated numbers to see if everything was fine. – Nerva Oct 30 '14 at 11:23
  • I'm not sure that your `curand_init` instruction is correct. I believe you should use `curand_init((unsigned long long)clock(), tId, 0, &state)`. In your way, all the threads are accessing the same subsequence. In the latter case, all the threads are accessing different subsequences. Please, have a look at Robert Crovella's answer to my question [offset parameter of curand_init](http://stackoverflow.com/questions/25827825/offset-parameter-of-curand-init). – Vitality Oct 30 '14 at 13:00
  • Oh yeah I forgot that! But I still get just zeros. So are you telling me that on your computer works fine? (I've updated the code) – Nerva Oct 30 '14 at 13:47
  • On the first part of your remark you are correct: I simply misinterpreted your instruction (neglected the presence of `tId` in the seed). On the second part, I do not think you are correct. As mentioned by Robert Crovella, the entire generated pseudorandom sequence is divided in a certain number of uncorrelated subsequences. The second argument can be used to access a different subsequence. So, if you use `curand_init((unsigned long long)clock(), tId, 0, &state)`, then `tId == 0` will generate numbers which are statistically uncorrelated. – Vitality Oct 30 '14 at 13:56
  • No. For each seed, the overall sequence generated by the pseudorandom number generator is divided in subsequences. For example, `1, 2, 3, 4, 5, 6, 7, ...` will be subdivided in subsequences like `1, 2, 3` (first subsequence), `4, 5, 6` (second subsequence), etc. So `tId == 0` will generate `1`, while `tId == 1` will generate `4`. – Vitality Oct 30 '14 at 14:04
  • 1
    Concerning this point, let me observe that your call to `curand_init` is "non-orthodox". From the cuRAND documentation: _Sequences generated with different seeds usually do not have statistically correlated values, but some choices of seeds may give statistically correlated sequences_. From this point of view, perhaps it is more safe to use the standard call I reported above, rather than your "non-orthodox" call. Let me underline "perhaps", because I have no evidence of that, other than the quoted clue from the documentation. – Vitality Oct 30 '14 at 14:17
  • I'm not sure that we have to follow the approach in the quoted devtalk thread. The same poster is providing the following warning: _The downside is that you lose some of the nice mathematical properties between threads. It is possible that there is a bad interaction between the hash function that initializes the generator state from the seed and the periodicity of the generators. If that happens, you might get two threads with highly correlated outputs for some seeds. I don't know of any problems like this, and even if they do exist they will most likely be rare._ – Vitality Oct 30 '14 at 15:57
  • If I use N threads and N floats how can I write out of bounds? – Nerva Oct 30 '14 at 18:04
  • @user3379939 please see my updated answer. Everything is working just fine for me so there has to be an eror somewhere else in your code. I'll also delete the above comments as I summed up everything in the updated answer hopefully. – Michal Hosala Oct 30 '14 at 18:15
  • I was compiling for arch_sm=20 while I have arch_sm=13. This was the problem! – Nerva Oct 30 '14 at 22:13
  • @user3379939 I am glad you managed to track down the issue, was the linked error checking SO thread useful? – Michal Hosala Oct 30 '14 at 22:18
  • I already knew that thread but thank you. Also I recommend you to use the `checkCudaErrors` function from the following file `helper_cuda.h` that is provided by nvidia and that you can find in the following directory: `C:\Program Files\NVIDIA Corporation\Installer2\CUDASamples_6.5.{8EB9659F-A17D-475C-A1C2-BB1EA3E7E080}\common\inc\helper_cuda.h`. I think that he took his answer from that file. P.S. you have to copy `helper_string.h` too and change its inclusion as well into `helper_cuda.h`. – Nerva Oct 30 '14 at 22:35