First, I don't think anyone can give you a definitive answer, we can only give you different options and you will need to run performance measurements for your particular use case yourself to find a solution that's optimal for your specific requirements.
Some suggestions:
- The Box-Muller transform is certainly a decent way to generate gaussian-distributed values, but requires
sin
, cos
, logarithm, and square root computations. You can remove the need for sine and cosine by switching to the Marsaglia polar method. This does come at the cost of requiring you to generate a larger quantity of uniformly distributed values. Depending on how your RNG performs on your GPU, this may still work to your advantage, however.
- Be careful with linear congruencial RNGs (LCG), they exhibit patterns that sometimes interact badly with transformation algorithms such as Box-Muller. The one you linked is an MWC generator, which is a technique related to linear congruence, so might have similar issues. I would probably try exploring other generators. I haven't had a chance to try it myself yet, but there exists a Mersenne Twister variant for GPUs which I would imagine working well for many applications. One advantage of Mersenne Twister is that it mostly uses bitwise manipulation instructions, which tend to be very fast on GPUs, unlike integer multiplication and division.
There are definitely plenty of libraries out there, but I'll point out that for best performance you'll probably want to keep the random number generating code running in the same OpenCL work-item as the code that uses the samples. Writing out to a memory buffer will put a strain on memory bandwidth, although if your subsequent processing code is heavily ALU/FPU bound this might not matter.
As with any random number generation, testing is key - at the very least, plot a histogram of the samples your code generates and overlay it with the theoretical distribution function you're trying to obtain and visually inspect it to make sure it looks reasonable.