17

I've struggled with this all day, I am trying to get a random number generator for threads in my CUDA code. I have looked through all forums and yes this topic comes up a fair bit but I've spent hours trying to unravel all sorts of codes to no avail. If anyone knows of a simple method, probably a device kernel that can be called to returns a random float between 0 and 1, or an integer that I can transform I would be most grateful.

Again, I hope to use the random number in the kernel, just like rand() for instance.

Thanks in advance

Ashwin Nanjappa
  • 76,204
  • 83
  • 211
  • 292
zenna
  • 9,006
  • 12
  • 73
  • 101

9 Answers9

16

For anyone interested, you can now do it via cuRAND.

Kostis
  • 447
  • 4
  • 13
4

The best way for this is writing your own device function , here is the one

void RNG()
{   
    unsigned int m_w = 150;
    unsigned int m_z = 40;

    for(int i=0; i < 100; i++)
    {
        m_z = 36969 * (m_z & 65535) + (m_z >> 16);
        m_w = 18000 * (m_w & 65535) + (m_w >> 16);

        cout <<(m_z << 16) + m_w << endl;  /* 32-bit result */
    }
}

It'll give you 100 random numbers with 32 bit result.

If you want some random numbers between 1 and 1000, you can also take the result%1000, either at the point of consumption, or at the point of generation:

((m_z << 16) + m_w)%1000

Changing m_w and m_z starting values (in the example, 150 and 40) allows you to get a different results each time. You can use threadIdx.x as one of them, which should give you different pseudorandom series each time.

I wanted to add that it works 2 time faster than rand() function, and works great ;)

qdot
  • 6,195
  • 5
  • 44
  • 95
Hayk Nahapetyan
  • 4,452
  • 8
  • 42
  • 61
  • 3
    Editor's comment: It's a neat little RNG, but it's nowhere near usable for computation where good random numbers are required. It has a period of 2^16, it cannot span the entire 32bit number space, and it is fairly easily reversible due to a simple step and small period. – qdot Sep 01 '12 at 18:34
4

I think any discussion of this question needs to answer Zenna's orginal request and that is for a thread level implementation. Specifically a device function that can be called from within a kernel or thread. Sorry if I overdid the "in bold" phrases but I really think the answers so far are not quite addressing what is being sought here.

The cuRAND library is your best bet. I appreciate that people are wanting to reinvent the wheel (it makes one appreciate and more properly use 3rd party libraries) but high performance high quality number generators are plentiful and well tested. The best info I can recommend is on the documentation for the GSL library on the different generators here:http://www.gnu.org/software/gsl/manual/html_node/Random-number-generator-algorithms.html

For any serious code it is best to use one of the main algorithms that mathematicians/computer-scientists have into the ground over and over looking for systemic weaknesses. The "mersenne twister" is something with a period (repeat loop) on the order of 10^6000 (MT19997 algorithm means "Mersenne Twister 2^19997") that has been especially adapted for Nvidia to use at a thread level within threads of the same warp using thread id calls as seeds. See paper here:http://developer.download.nvidia.com/compute/cuda/2_2/sdk/website/projects/MersenneTwister/doc/MersenneTwister.pdf. I am actually working to implement somehting using this library and IF I get it to work properly I will post my code. Nvidia has some examples at their documentation site for the current CUDA toolkit.

NOTE: Just for the record I do not work for Nvidia, but I will admit their documentation and abstraction design for CUDA is something I have so far been impressed with.


opetrenko
  • 346
  • 3
  • 8
4

Depending on your application you should be wary of using LCGs without considering whether the streams (one stream per thread) will overlap. You could implement a leapfrog with LCG, but then you would need to have a sufficiently long period LCG to ensure that the sequence doesn't repeat.

An example leapfrog could be:

template <typename ValueType>
__device__ void leapfrog(unsigned long &a, unsigned long &c, int leap)
{
    unsigned long an = a;
    for (int i = 1 ; i < leap ; i++)
        an *= a;
    c = c * ((an - 1) / (a - 1));
    a = an;
}

template <typename ValueType>
__device__ ValueType quickrand(unsigned long &seed, const unsigned long a, const unsigned long c)
{
    seed = seed * a;
    return seed;
}

template <typename ValueType>
__global__ void mykernel(
    unsigned long *d_seeds)
{
    // RNG parameters
    unsigned long a = 1664525L;
    unsigned long c = 1013904223L;
    unsigned long ainit = a;
    unsigned long cinit = c;
    unsigned long seed;

    // Generate local seed
    seed = d_seeds[bid];
    leapfrog<ValueType>(ainit, cinit, tid);
    quickrand<ValueType>(seed, ainit, cinit);
    leapfrog<ValueType>(a, c, blockDim.x);

    ...
}

But then the period of that generator is probably insufficient in most cases.

To be honest, I'd look at using a third party library such as NAG. There are some batch generators in the SDK too, but that's probably not what you're looking for in this case.

EDIT

Since this just got up-voted, I figure it's worth updating to mention that cuRAND, as mentioned by more recent answers to this question, is available and provides a number of generators and distributions. That's definitely the easiest place to start.

Tom
  • 20,852
  • 4
  • 42
  • 54
4

I'm not sure I understand why you need anything special. Any traditional PRNG should port more or less directly. A linear congruential should work fine. Do you have some special properties you're trying to establish?

Charlie Martin
  • 110,348
  • 25
  • 193
  • 263
  • I think he's looking for a library he could call, not to implement it himself. Still a good answer to point him to a solution. – lothar May 08 '09 at 02:20
  • Linear congruential is very simple to implement. You can do this with CUDA by having a separate PRNG with its own state in each thread. – Jay Conrod May 08 '09 at 02:25
  • Thats what got me a little confused. Each thread would say be seeded from its thread id, but they wouldnt they soon enough start overlapping? – zenna May 08 '09 at 02:33
  • Those random algorithms calculate x_n+1 from x_n, an attempt to use them for parallel random number creation will leading to "random" numbers with a very distinct pattern. This is because x_n+1 is a function of x_n. – Danny Varod Jun 15 '09 at 19:56
  • alifeofzen: linerar dependency in the seeds is bad enough, indeed (cf. http://portal.acm.org/citation.cfm?doid=1276927.1276928), maybe you should find some other way of seeding them. Danny: The easiest (for that topic, as random numbers for parallel and distributed systems are very hard to get right) might be a series of lagged Fibonacci generators. I just don't find the paper anymore that outlined this. – Joey Jul 08 '09 at 09:40
2

There's an MDGPU package (GPL) which includes an implementation of the GNU rand48() function for CUDA here.

I found it (quite easily, using Google, which I assume you tried :-) on the NVidia forums here.

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • Yeah I found that too.. but struggled to get it to do what I want to.. I think I'm just having a stupid day.. I'll check it out again, thanks – zenna May 08 '09 at 02:29
  • According to the comments in the NVidia forum (including the author's) the implementation doesn't work well. – Danny Varod Jun 15 '09 at 20:04
2

I haven't found a good parallel number generator for CUDA, however I did find a parallel random number generator based on academic research here: http://sprng.cs.fsu.edu/

Danny Varod
  • 17,324
  • 5
  • 69
  • 111
  • Anyone know of a CUDA version of this algorithm? – Danny Varod Sep 20 '09 at 10:38
  • What do you mean by "good"? Depending on your requirements a simple MD5 hash (see cuDPP) may be enough. For some cases, multiple Mersenne Twisters may be best since they have a really long period and good independence between streams. NAG have l'Ecuyer's MRG32k3a which works really well if you need a single stream across multiple threads/blocks. – Tom Nov 22 '09 at 12:12
  • A good start would be a repetitive pseudo-random number generator with low dependency between the cells - suitable, for creating a set of random number array, filling the contents of each array with multiple threads, but creating the arrays one after the other. – Danny Varod Nov 25 '09 at 23:45
0

You could try out Mersenne Twister for GPUs

It is based on SIMD-oriented Fast Mersenne Twister (SFMT) which is a quite fast and reliable random number generator. It passes Marsaglias DIEHARD tests for Random Number Generators.

mrossi
  • 437
  • 6
  • 8
  • cuRAND already [provides mersenne twister](http://docs.nvidia.com/cuda/curand/acknowledgements.html#acknowledgements) implemented for the GPU. – Robert Crovella Jul 16 '14 at 18:16
0

In case you're using cuda.jit in Numba for Python, this Random number generator is useful.