Fast Vector Gaussian Normal Random Numbers in C on Intel Core Processors (AVX, AES)?

Question

THIS QUESTION IS ABOUT C++ CODE TARGETED FOR AVX/AVX2 INSTRUCTIONS, as shipped in Intel processors since 2013 (and/or AVX-512 since 2015).

How do I generate one million random Gaussian unit normals fast on Intel processors with new instructions sets?

More generic versions of this question were asked a few times before, e.g., as in Generate random numbers following a normal distribution in C/C++. Yes, I know about Box-Muller and adding and other techniques. I am tempted to build my inverse normal distribution, sample (i.e., map) exactly according to expectations (pseudo-normals, then), and then randomly rearrange sort order.

But, I also know I am using an Intel Core processor with recent AVX vector and AES instruction sets. besides, I need C (not C++ with its std library), and it needs to work on Linux and OSX with gcc.

So, is there a better processor-specific way to generate so many random numbers fast? For such large quantities of random numbers, does Intel processor hardware even offer useful instructions? Are they an option worth looking into: and if so, is there an existing standard function implementation of "rnorm"?

How fast do you need to generate a million random normal values? Taking the basic code for Box-Muller from Wikipedia, using `erand48()` to generate the numbers (primarily so as not to interfere with any other sequences of random numbers), and running on a MacBook Pro (mid-2011; 2.3 GHz Intel Core i7; 16 GiB 1333 MHz DDR3 memory) running Yosemite 10.10.4 and using GCC 5.1.0), I timed 100 million random values in 4.1-4.2 seconds. I previously tested that algorithm. Using new code for the polar form, I got 100 million random values in 3.1-3.2 seconds. I'm now reasonably confident with the result. — Jonathan Leffler, Jul 04 '15 at 22:31
Does your CPU support only AVX, or AVX2 with support for FMA (fused multiply-add)? What problems did you encounter with Box-Muller? Is the data type for the generated numbers 'float' or 'double'? — njuffa, Jul 05 '15 at 00:00
Please check http://www.doc.ic.ac.uk/~wl/papers/07/csur07dt.pdf. For fastest implementation using vector unit like AVX/AVX2 I would suggest to look into Wallace method. — Severin Pappadeux, Jul 05 '15 at 00:59
If mention of AVX implies a desire to vectorize the computation, I would suggest asking that explicitly to prevent closing of the question as a duplicate. Box-Muller has the advantage of being branchless and thus being trivial to vectorize provided needed math functions are available in vectorized form. Apple's Accelerate framework provides `vvsqrtf`, `vvlogf`, `vvsinpif`, and `vvcospif` which are precisely such functions. An auto-vectorizable source code for `sincospif()` is relatively easy to create by hand. The problem I see is finding an easily vectorizable source of uniform random numbers — njuffa, Jul 05 '15 at 06:12
the TLLV paper referenced is very nice (brief summary---wallace and ziggurat are fastest). I need also linux, so apple only does not work. but most of all, I wonder if one can do an order of magnitude better with modern intel processor hardware over generic algorithms. — ivo Welch, Jul 05 '15 at 07:51
@ivoWelch With multiply-and-add and carefully tuned size of the mixing matrices (so they are set to maximum vector size for a given vector unit) Wallace would be fastest, pretty sure about it.Whether is would be order of magnitude faster, well, have to write code and check — Severin Pappadeux, Jul 05 '15 at 17:33
@ivoWelch Conventional wisdom says Wallace and Ziggurat work well for serial implementations, not so great for vectorization. The Intel compiler provides vector math functions, but you intend to use gcc. I wrote C99 code for `sincospif()` that auto-vectorizes well using AVX as target. Got stuck on autovectorizable `logf()` due to the need for `frexpf()` functionality. Problem seems to be lack of integer operations in AVX; they are there in AVX2 thus my question whether your machine has AVX2. Certainly not an insurmountable obstacle but requires more time to resolve than I was willing to invest — njuffa, Jul 05 '15 at 18:04
@njuffa `Wallace and Ziggurat work well for serial implementations, not so great for vectorization.` That's were I disagree. For Wallace fast vectors and matrix operations shall be a big win, up to size of the vector unit. And if one has multiply-and-add, orthogonal matrix transformation would be even better. — Severin Pappadeux, Jul 05 '15 at 18:24
Using 1 million instances of a simple PRNG with only 48 bits of state, I can generate 558M single-precision uniformly distributed random numbers per second on one core of a Xeon E3 1270 v2 (IvyBridge), vectorizing the loop with the Intel compiler. If I add the Box-Muller transform, I can generate 284M single-precision normally distributed random numbers. The vectorized `sqrtf` and `logf` for this come from Intel's VML, while the `sincospif` implementation is my own code (< 1.8 ulp of error). So adding Box-Muller roughly halved the performance of the uniform PRNG. — njuffa, Jul 05 '15 at 20:48
Sorry, mispoke. The PRNG I used earlier has 96 bits of state. Since it makes heavy use of shifts, and AVX does not have shift instructions, the Intel compiler moves the computation into SSE registers (XMM) when vectorizing. I have replaced the PRNG with Marsaglia's KISS generator which has 128 bits of state per instance. The throughput of normally distributed random 'float' numbers on my CPU is roughly the same as before, 276M / sec. — njuffa, Jul 05 '15 at 21:33
one not-mentioned way is to work based on the pdf. calculate a table of exponentials, then repeat each x value in this table f(x) times [according to the value of the function], and then reshuffle them randomly. some extra magic for the edges may be needed, possibly for the slope of the function. alas, I have not found any existing vector implementations. and fwiw, the gnu gsl normal B-M rand speed is terrible, because it wants to be reentrant. — ivo Welch, Jul 13 '15 at 09:29
BTW, I don't think `rdrand` (hardware RNG) helps much. A SIMD PRNG will probably do better than getting hardware randomness into integer registers. — Peter Cordes, Mar 01 '18 at 19:08
The obvious approach to vectorization is to run 8 separate PRNGs in parallel, in the 8 elements of a `__m256` vector. Then you can do whatever you want with pure vertical SIMD, like njuffa's comments suggest. e.g. make a SIMD implementation of any of the algorithms in [Generate random numbers following a normal distribution in C/C++](https://stackoverflow.com/questions/2325472/generate-random-numbers-following-a-normal-distribution-in-c-c). (I agree this isn't exactly a duplicate, but I think the answer is just do that in parallel with no interaction between vector elements.) — Peter Cordes, Mar 02 '18 at 00:39

Fast Vector Gaussian Normal Random Numbers in C on Intel Core Processors (AVX, AES)?

0 Answers0