How to speed up tricky random number generation

Question

I have some code which performs many log, tan and cos operations on doubles. I need this to be as fast as possibly. Currently I use code such as

#include <stdio.h>
#include <stdlib.h>
#include "mtwist.h"
#include <math.h>


int main(void) {
   int i;
   double x;
   mt_seed();
   double u1;
   double u2;
   double w1;
   double w2;
   x = 0;
   for(i = 0; i < 100000000; ++i) {
     u1 = mt_drand();
     u2 = mt_drand();
     w1 = M_PI*(u1-1/2.0);
     w2 = -log(u2);
     x += tan(w1)*(M_PI_2-w1)+log(w2*cos(w1)/(M_PI_2-w1));
   }
   printf("%f\n",x); 

   return EXIT_SUCCESS;
}

I am using gcc.

There are two obvious ways to speed this up. The first is to choose a faster RNG. The second is to speed up the transcendental functions.
To do this I would like to know

How are tan and cos implemented in assembly on x86? My CPU is the AMD FX-8350 if that makes a difference. (Answer fcos for cos and fptan for tan.)
How can you use a lookup table to speed up the calculations? I only need 32 bits of accuracy. Can you make use a table of size 2^16 for example to speed up the tan and cos operations?

The Intel optimization manual says

If there is no critical need to evaluate the transcendental functions using the extended precision of 80 bits, applications should consider an alternate, software-based approach, such as a look-up-table-based algorithm using interpolation techniques. It is possible to improve transcendental performance with these techniques by choosing the desired numeric precision and the size of the look-up table, and by taking advantage of the parallelism of the SSE and the SSE2 instructions.

According to this very helpful table, fcos has latency 154 and fptan has latency 166-231.

You can compile my code using

gcc -O3 -Wall random.c mtwist-1.5/mtwist.c -lm -o random

My C code uses the Mersenne Twister RNG C code from here . You should be able to run my code to test it. If you can't, please let me know.

Update @rhashimoto has sped up my code from 20 seconds to 6 seconds!

The RNG seems like it should be possible to speed up. However in my tests http://www.math.sci.hiroshima-u.ac.jp/~%20m-mat/MT/SFMT/index.html#dSFMT takes exactly the same amount of time (does anyone see anything different). If anyone can find a faster RNG (that passes all the diehard tests) I would be very grateful.

Please show real timings for any improvement you suggest as that really helps work out what does or does not work.

did you actually profiled your code and came up to the conclusion that the bottlenecks are these functions or you are just getting into a pre-mature optimisation? — Peter Varo, May 23 '14 at 16:52
@PeterVaro The code produces random numbers from a tricky distribution. So it does very little other than pick a random number from (0,1) and apply tan and cos (and actually log). I need to optimize both these parts. I just need billions of them. — Simd, May 23 '14 at 16:56
@user2179021 I've found that it is often more accurate and faster to do the math than do Monte Carlo to characterize such systems. — IdeaHat, May 23 '14 at 17:00
Without checking details: a 2^16 table of base _and_ slope for each tan, cos, log) may do the trick. Might give 24+ bits of accuracy. — chux - Reinstate Monica, May 23 '14 at 17:40
@KenA w1 is a double but I only need 32 bits of accuracy in the end. I have added more code to make this clearer. — Simd, May 23 '14 at 18:57
But are you using extended precision? If you're using double then it's 64 bits and that comment from the Intel manual may not apply. The last time I did a performance test in an x86 processor, the math coprocessor got faster than table lookup (to my disbelief, but I double-checked) — Fabio Ceconello, May 23 '14 at 19:03
It comes from mt_drand ultimately which is defined in mtwist. See http://fmg-www.cs.ucla.edu/geoff/tars/mtwist-1.5.tgz . "drand functions generate a double-precision number in the range [0,1) (i.e., 0 is a possible value but 1 is not). The number generated by drand has 32 bits of precision." — Simd, May 23 '14 at 19:06
Where are they profile data that says you are spending too much time calculating the transcendental functions, as opposed to the RNG? — mctylr, May 23 '14 at 20:01
@mctylr They both take too long. My profiling (just using gprof) says that the tan, cos and log operations take roughly the same time combined as the two calls to mt_drand. I need to speed both up. — Simd, May 23 '14 at 20:03
Try gcc flags `-march=native -mtune=native` to optimize for your build machine. Also you don't need to compute both `cos` and `tan` from scratch - try `sincos` or `sqrt(1.0 - cosw1*cosw1)/cosw1` (the latter is faster for me). — rhashimoto, May 30 '14 at 17:55
@rhashimoto Thank you. I added a comment to your answer about these suggestions. — Simd, May 31 '14 at 19:15
Can you just knock a couple zeros off the end of your for loop condition? Seems like a million iterations would be sufficiently random. It would go 100x faster if you did. — phatfingers, May 31 '14 at 20:23
x87 uses 80-bit extended precision and in most cases it's slower than a careful design of SSE/SSE2. For random number generations, SSE is also faster and can emit multiple numbers at the same time http://www.math.sci.hiroshima-u.ac.jp/~%20m-mat/MT/SFMT/index.html https://software.intel.com/en-us/articles/fast-random-number-generator-on-the-intel-pentiumr-4-processor — phuclv, Jun 01 '14 at 02:51
@LưuVĩnhPhúc I tried dSFMT and it was slower! Can you try it and see if you get a speedup? You should be able to use my code easily but if you can't, please let me know. — Simd, Jun 01 '14 at 06:33

score 7 · Answer 1 · answered May 23 '14 at 19:30

7

You can rewrite

tan(w1)*(M_PI_2-w1)+log(w2*cos(w1)/(M_PI_2-w1))

as

tan(w1)*(M_PI_2-w1) + log(cos(w1)/(M_PI_2-w1)) + log(w2).

You can probably horse around with minimax polynomials for the stuff depending on w1 here. Make up 64 of them or so, each one for 1/64th of the range, and you probably only need degree 3 or 4.

You computed w2 as

w2 = -log(u2);

for a uniform u2 in (0,1). So you're really computing log(log(1/u2)). I bet you can use a similar trick to get piecewise polynomial approximations to log(log(1/x)) on chunks of (0,1). (The function acts scary near 0 and 1, so you might need to do something fancy there instead.)

answered May 23 '14 at 19:30

tmyklebu

13,915
3
28
57

Thank you. I would love to hear more about this. In fact there is a bug in my code as I never want u1 or u2 to equal 0. – Simd May 23 '14 at 19:47
2

Since you can generate `u2` as `random_i2/random_period` your `-log(u2)` can be calculated as `-log(random_i2) + log(random_period)`, and `random_period` is constant. – Doug Currie May 23 '14 at 20:11

score 6 · Answer 2 · answered May 23 '14 at 19:34

6

I like @tmyklebu's suggestion to create a minimax approximation to the overall calculation. There are some nice tools to help with this, including Remez function approximation toolkit

You can do a lot better than MT for speed; see for example this Dr. Dobbs article: Fast, High-Quality, Parallel Random Number Generators

Also take a look at the ACML – AMD Core Math Library to take advantage of SSE and SSE2.

answered May 23 '14 at 19:34

Doug Currie

40,708
1
95
119

Thank you. I am told that for ultimate speed you need to create the floating point directly and not make an integer and then divide. But I don't know how to do that . – Simd May 23 '14 at 19:39
2

For conversion of integer random numbers to floats, see http://www.doornik.com/research/randomdouble.pdf – Doug Currie May 23 '14 at 19:45
1

Sollya's a thing, too. – tmyklebu May 23 '14 at 20:52
@tmyklebu, I was able to build Sollya on OSX pretty easily, (all the dependencies except libfplll are in macports), and it looks really nice. Thanks. – Doug Currie May 27 '14 at 19:54
@user2179021: I haven't tried with your particular function, but applying a similar technique to `log(1 + exp(x))` and `1 / (1 + exp(-x))` (with a suitably low accuracy and smallish range) can speed up some implementations of logistic regression classifiers. Your function is a little bit nastier. – tmyklebu May 27 '14 at 21:14

rhashimoto · Accepted Answer · 2014-06-01T22:23:05.030

You can try this log(x) replacement I wrote using SSE2 intrinsics:

#include <assert.h>
#include <immintrin.h>

static __m128i EXPONENT_MASK;
static __m128i EXPONENT_BIAS;
static __m128i EXPONENT_ZERO;
static __m128d FIXED_SCALE;
static __m128d LOG2ERECIP;
static const int EXPONENT_SHIFT = 52;

// Required to initialize constants.
void sselog_init() {
   EXPONENT_MASK = _mm_set1_epi64x(0x7ff0000000000000UL);
   EXPONENT_BIAS = _mm_set1_epi64x(0x00000000000003ffUL);
   EXPONENT_ZERO = _mm_set1_epi64x(0x3ff0000000000000UL);
   FIXED_SCALE = _mm_set1_pd(9.31322574615478515625e-10); // 2^-30
   LOG2ERECIP = _mm_set1_pd(0.693147180559945309417232121459); // 1/log2(e)
}

// Extract IEEE754 double exponent as integer.
static inline __m128i extractExponent(__m128d x) {
   return
      _mm_sub_epi64(
         _mm_srli_epi64(
            _mm_and_si128(_mm_castpd_si128(x), EXPONENT_MASK),
            EXPONENT_SHIFT),
         EXPONENT_BIAS);
}

// Set IEEE754 double exponent to zero.
static inline __m128d clearExponent(__m128d x) {
   return
      _mm_castsi128_pd(
         _mm_or_si128(
            _mm_andnot_si128(
               EXPONENT_MASK,
               _mm_castpd_si128(x)),
            EXPONENT_ZERO));
}

// Compute log(x) using SSE2 intrinsics to >= 30 bit precision, except denorms.
double sselog(double x) {
   assert(x >= 2.22507385850720138309023271733e-308); // no denormalized

   // Two independent logarithms could be computed by initializing
   // base with two different values, either with independent
   // arguments to _mm_set_pd() or from contiguous memory with
   // _mm_load_pd(). No other changes should be needed other than to
   // extract both results at the end of the function (or just return
   // the packed __m128d).

   __m128d base = _mm_set_pd(x, x);
   __m128i iLog = extractExponent(base);
   __m128i fLog = _mm_setzero_si128();

   base = clearExponent(base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   // fLog = _mm_slli_epi64(fLog, 10); // Not needed first time through.
   fLog = _mm_or_si128(extractExponent(base), fLog);

   base = clearExponent(base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   fLog = _mm_slli_epi64(fLog, 10);
   fLog = _mm_or_si128(extractExponent(base), fLog);

   base = clearExponent(base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   fLog = _mm_slli_epi64(fLog, 10);
   fLog = _mm_or_si128(extractExponent(base), fLog);

   // No _mm_cvtepi64_pd() exists so use _mm_cvtepi32_pd() conversion.
   iLog = _mm_shuffle_epi32(iLog, 0x8);
   fLog = _mm_shuffle_epi32(fLog, 0x8);

   __m128d result = _mm_mul_pd(_mm_cvtepi32_pd(fLog), FIXED_SCALE);
   result = _mm_add_pd(result, _mm_cvtepi32_pd(iLog));

   // Convert from base 2 logarithm and extract result.
   result = _mm_mul_pd(result, LOG2ERECIP);
   return ((double *)&result)[0]; // other output in ((double *)&result)[1]
}

The code implements an algorithm described in this Texas Instruments brief, of repeatedly squaring the argument and concatenating exponent bits. It will not work with denormalized inputs. It provides at least 30 bits of precision.

This runs faster than log() on one of my machines and slower on the other, so your mileage may vary; I don't claim this is necessarily the best approach. This code, however, actually computes two logarithms in parallel using both halves of a 128-bit SSE2 word (although the function as-is only returns one result), so it could be adapted into one building block of a SIMD calculation of your entire function (and I think log is the difficult part as cos is pretty well-behaved). Furthermore, your processor supports AVX which can pack 4 double-precision elements into a 256-bit word, and extending this code to AVX should be straightforward.

If you choose not to go full SIMD, you could still use both logarithm slots by pipelining - i.e. compute log(w2*cos(w1)/(M_PI_2-w1)) for the current iteration with log(u2) for the next iteration.

Even if this function benchmarks slower against log in isolation, it may still be worth testing with your actual function. This code does not stress the data cache at all so it may be friendlier with other code that does (e.g. with a cos that uses lookup tables). Also microinstruction dispatch could be improved (or not) with surrounding code, depending on whether other code uses SSE.

My other advice (repeated from the comments) would be:

Try -march=native -mtune=native to get gcc to optimize for your CPU.
Avoid calling both tan and cos on the same argument - use sincos or a trig identity.
Consider using a GPU (e.g. OpenCL).

It seems like it's better to compute sin instead of cos - the reason is that you can use it for tan_w1 = sin_w1/sqrt(1.0 - sin_w1*sin_w1). Using cos, which I originally suggested, loses the proper sign when computing tan. And it looks like you can get a good speedup by using a minimax polynomial over [-pi/2, pi/2] as other answerers have said. The 7-term function at this link (make sure you get minimaxsin, not taylorsin) seems to work quite well.

So here's your program rewritten with all SSE2 trancendental approximations:

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#include "mtwist.h"

#if defined(__AVX__)
#define VECLEN 4
#elif defined(__SSE2__)
#define VECLEN 2
#else
#error // No SIMD available.
#endif

#if VECLEN == 4
#define VBROADCAST(K) { K, K, K, K };
typedef double vdouble __attribute__((vector_size(32)));
typedef long vlong __attribute__((vector_size(32)));
#elif VECLEN == 2
#define VBROADCAST(K) { K, K };
typedef double vdouble __attribute__((vector_size(16)));
typedef long vlong __attribute__((vector_size(16)));
#endif

static const vdouble FALLBACK_THRESHOLD = VBROADCAST(1.0 - 0.001);

vdouble sse_sin(vdouble x) {
   static const vdouble a0 = VBROADCAST(1.0);
   static const vdouble a1 = VBROADCAST(-1.666666666640169148537065260055e-1);
   static const vdouble a2 = VBROADCAST( 8.333333316490113523036717102793e-3);
   static const vdouble a3 = VBROADCAST(-1.984126600659171392655484413285e-4);
   static const vdouble a4 = VBROADCAST( 2.755690114917374804474016589137e-6);
   static const vdouble a5 = VBROADCAST(-2.502845227292692953118686710787e-8);
   static const vdouble a6 = VBROADCAST( 1.538730635926417598443354215485e-10);

   vdouble xx = x*x;
   return x*(a0 + xx*(a1 + xx*(a2 + xx*(a3 + xx*(a4 + xx*(a5 + xx*a6))))));
}

static inline vlong shiftRight(vlong x, int bits) {
#if VECLEN == 4
   __m128i lo = (__m128i)_mm256_extractf128_si256((__m256i)x, 0);
   __m128i hi = (__m128i)_mm256_extractf128_si256((__m256i)x, 1);
   return (vlong)
      _mm256_insertf128_si256(
         _mm256_castsi128_si256(_mm_srli_epi64(lo, bits)),
         _mm_srli_epi64(hi, bits),
         1);
#elif VECLEN == 2
   return (vlong)_mm_srli_epi64((__m128i)x, bits);
#endif
}

static inline vlong shiftLeft(vlong x, int bits) {
#if VECLEN == 4
   __m128i lo = (__m128i)_mm256_extractf128_si256((__m256i)x, 0);
   __m128i hi = (__m128i)_mm256_extractf128_si256((__m256i)x, 1);
   return (vlong)
      _mm256_insertf128_si256(
         _mm256_castsi128_si256(_mm_slli_epi64(lo, bits)),
         _mm_slli_epi64(hi, bits),
         1);
#elif VECLEN == 2
   return (vlong)_mm_slli_epi64((__m128i)x, bits);
#endif
}

static const vlong EXPONENT_MASK = VBROADCAST(0x7ff0000000000000L);
static const vlong EXPONENT_BIAS = VBROADCAST(0x00000000000003ffL);
static const vlong EXPONENT_ZERO = VBROADCAST(0x3ff0000000000000L);
static const vdouble FIXED_SCALE = VBROADCAST(9.31322574615478515625e-10); // 2^-30
static const vdouble LOG2ERECIP = VBROADCAST(0.6931471805599453094172);
static const int EXPONENT_SHIFT = 52;

// Extract IEEE754 double exponent as integer.
static inline vlong extractExponent(vdouble x) {
   return shiftRight((vlong)x & EXPONENT_MASK, EXPONENT_SHIFT) - EXPONENT_BIAS;
}

// Set IEEE754 double exponent to zero.
static inline vdouble clearExponent(vdouble x) {
   return (vdouble)(((vlong)x & ~EXPONENT_MASK) | EXPONENT_ZERO);
}

// Compute log(x) using SSE2 intrinsics to >= 30 bit precision, except
// denorms.
vdouble sse_log(vdouble base) {
   vlong iLog = extractExponent(base);
   vlong fLog = VBROADCAST(0);

   base = clearExponent(base);
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   fLog = shiftLeft(fLog, 10);
   fLog |= extractExponent(base);

   base = clearExponent(base);
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   fLog = shiftLeft(fLog, 10);
   fLog |= extractExponent(base);

   base = clearExponent(base);
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   fLog = shiftLeft(fLog, 10);
   fLog |= extractExponent(base);

   // No _mm_cvtepi64_pd() exists so use 32-bit conversion to double.
#if VECLEN == 4
   __m128i iLogLo = _mm256_extractf128_si256((__m256i)iLog, 0);
   __m128i iLogHi = _mm256_extractf128_si256((__m256i)iLog, 1);
   iLogLo = _mm_srli_si128(_mm_shuffle_epi32(iLogLo, 0x80), 8);
   iLogHi = _mm_slli_si128(_mm_shuffle_epi32(iLogHi, 0x08), 8);

   __m128i fLogLo = _mm256_extractf128_si256((__m256i)fLog, 0);
   __m128i fLogHi = _mm256_extractf128_si256((__m256i)fLog, 1);
   fLogLo = _mm_srli_si128(_mm_shuffle_epi32(fLogLo, 0x80), 8);
   fLogHi = _mm_slli_si128(_mm_shuffle_epi32(fLogHi, 0x08), 8);

   vdouble result = _mm256_cvtepi32_pd(iLogHi | iLogLo) +
      FIXED_SCALE*_mm256_cvtepi32_pd(fLogHi | fLogLo);
#elif VECLEN == 2
   iLog = (vlong)_mm_shuffle_epi32((__m128i)iLog, 0x8);
   fLog = (vlong)_mm_shuffle_epi32((__m128i)fLog, 0x8);

   vdouble result = _mm_cvtepi32_pd((__m128i)iLog) +
      FIXED_SCALE*_mm_cvtepi32_pd((__m128i)fLog);
#endif

   // Convert from base 2 logarithm and extract result.
   return LOG2ERECIP*result;
}

// Original computation.
double fallback(double u1, double u2) {
   double w1 = M_PI*(u1-1/2.0);
   double w2 = -log(u2);
   return tan(w1)*(M_PI_2-w1)+log(w2*cos(w1)/(M_PI_2-w1));
}

int main() {
   static const vdouble ZERO = VBROADCAST(0.0)
   static const vdouble ONE = VBROADCAST(1.0);
   static const vdouble ONE_HALF = VBROADCAST(0.5);
   static const vdouble PI = VBROADCAST(M_PI);
   static const vdouble PI_2 = VBROADCAST(M_PI_2);

   int i,j;
   vdouble x = ZERO;
   for(i = 0; i < 100000000; i += VECLEN) {
      vdouble u1, u2;
      for (j = 0; j < VECLEN; ++j) {
         ((double *)&u1)[j] = mt_drand();
         ((double *)&u2)[j] = mt_drand();
      }

      vdouble w1 = PI*(u1 - ONE_HALF);
      vdouble w2 = -sse_log(u2);

      vdouble sin_w1 = sse_sin(w1);
      vdouble sin2_w1 = sin_w1*sin_w1;

#if VECLEN == 4
      int nearOne = _mm256_movemask_pd(sin2_w1 >= FALLBACK_THRESHOLD);
#elif VECLEN == 2
      int nearOne = _mm_movemask_pd(sin2_w1 >= FALLBACK_THRESHOLD);
#endif
      if (!nearOne) {
#if VECLEN == 4
         vdouble cos_w1 = _mm256_sqrt_pd(ONE - sin2_w1);
#elif VECLEN == 2
         vdouble cos_w1 = _mm_sqrt_pd(ONE - sin2_w1);
#endif
         vdouble tan_w1 = sin_w1/cos_w1;

         x += tan_w1*(PI_2 - w1) + sse_log(w2*cos_w1/(PI_2 - w1));
      }
      else {
         vdouble result;
         for (j = 0; j < VECLEN; ++j)
            ((double *)&result)[j] = fallback(((double *)&u1)[j], ((double *)&u2)[j]);
         x += result;
      }
   }

   double sum = 0.0;
   for (i = 0; i < VECLEN; ++i)
      sum += ((double *)&x)[i];

   printf("%lf\n", sum);
   return 0;
}

I ran into one annoying problem - the sin approximation error near ±pi/2 can put the value slightly outside [-1, 1] and that (1) causes the computation of tan to be invalid and (2) causes outsized effects when the log argument is near 0. To avoid this the code tests if sin(w1)^2 is "close" to 1 and if it is then it falls back to the original full double precision path. The definition of "close" is in FALLBACK_THRESHOLD at the top of the program - I arbitrarily set it to 0.999 which still appears to return values in the range of the OP's original program but has little effect on performance.

I've edited the code to use gcc-specific syntax extensions for SIMD. If your compiler doesn't have these extensions then you can go back through the edit history. The code now uses AVX if enabled in the compiler to process 4 doubles at a time (instead of 2 doubles with SSE2).

The results on my machine without calling mt_seed() to get a repeatable result are:

Version   Time         Result
original  14.653 secs  -1917488837.945067
SSE        7.380 secs  -1917488837.396841
AVX        6.271 secs  -1917488837.422882

It makes sense that the SSE/AVX results are different from the original result because of the trancendental approximations. I think you should be able to tweak FALLBACK_THRESHOLD to trade off precision and speed. I'm not sure why the SSE and AVX results are slightly different from each other.

Replacing tan(w1) by tanw1 = sqrt(1.0 - cosw1*cosw1)/cosw1; cuts 5 seconds off the time (from 20 to 15 seconds). Using -march=native -mtune=native makes almost no difference. I will try your log now. I would love a solution that used AVX but I don't think I am competent to do it on my own. — Simd, May 31 '14 at 18:54
Would you mind adding complete code or telling me how to fix http://paste.ubuntu.com/7560751/ please? — Simd, May 31 '14 at 19:04
I get error: initializer element is not constant static const __m128d FIXED_SCALE = _mm_set1_pd(9.31322574615478515625e-10); // 2^-30 — Simd, May 31 '14 at 19:19
That's a C error; C++ lets you initialize static data with function calls. I modified my answer to build with C, but now you must call `sselog_init()` when you initialize your program. I also added `#include ` and successfully built with `gcc -o a.out -O3 -march=native -mtune=native -DNDEBUG -Imtwist-1.5 test.c mtwist-1.5/mtwist.c -lm`. — rhashimoto, May 31 '14 at 19:29
Now down to 12 seconds! Curiously -march=native -mtune=native actually slows it down by 0.3 seconds. — Simd, May 31 '14 at 19:34
I'll be in http://chat.stackoverflow.com/rooms/54851/how-to-speed-up-tricky-random-number-generation for a half hour or so. — rhashimoto, May 31 '14 at 19:35
Edited answer to suggest using `sin` instead of `cos`, and a link to a minimax replacement for `sin()` over [-pi/2, pi/2]. — rhashimoto, May 31 '14 at 22:39
Thank you. I am a little worried about the bugs you have mentioned now. — Simd, Jun 01 '14 at 12:04
Final (really) edit will build with AVX if enabled. The results match the original program quite closely using the same random number stream. — rhashimoto, Jun 01 '14 at 22:25
Down to six seconds (from 20). And this time -march=native -mtune=native did help. It cut 2 seconds off. I just need to find a fast RNG now. — Simd, Jun 02 '14 at 19:15
How about adding http://xorshift.di.unimi.it/xorshift128plus.c just for good measure :) I got it from http://xorshift.di.unimi.it/ . — Simd, Jun 02 '14 at 19:26

gnasher729 · Answer 4 · 2014-06-01T00:54:22.003

First, a little transformation. This is the original sum:

for(i = 0; i < 100000000; ++i) {
    u1 = mt_drand();
    u2 = mt_drand();
    w1 = M_PI*(u1-1/2.0);
    w2 = -log(u2);
    x += tan(w1)*(M_PI_2-w1)+log(w2*cos(w1)/(M_PI_2-w1));
}

This sum is mathematically equivalent:

for(i = 0; i < 100000000; ++i) {
    u1 = M_PI - mt_drand()* M_PI;
    u2 = mt_drand();
    x += u1 / tan (u1) + log (sin (u1) / u1) + log (- log (u2));
}

And since it should be equivalent to replace mt_drand () with 1.0 - mt_rand (), we can let u1 = mt_drand () * M_PI.

for(i = 0; i < 100000000; ++i) {
    u1 = mt_drand()* M_PI;
    u2 = mt_drand();
    x += u1 / tan (u1) + log (sin (u1) / u1) + log (- log (u2));
}

So that's nicely separated now into two functions of a random variable that can be handled separately; x += f (u1) + g (u2). Both functions are nicely smooth over long ranges. f is quite smooth for say u1 > 0.03, and 1 / f is quite smooth for smaller values. g is quite smooth except for values close to 0 or 1. So we could use let's say 100 different approximations for the intervals [0 .. 0.01], [0.01 .. 0.02] and so on. Except that picking the right approximation is time consuming.

To solve this: A linear random function in the interval [0 .. 1] will have a certain number of values in the interval [0 .. 0.01], another number of values in [0.01 .. 0.02] and so on. I think you can calculate how many random numbers out of 100,000,000 fall into the interval [0 .. 0.01] by assuming a normal distribution. Then you calculate how many of the remaining fall into [0.01 .. 0.02] and so on. If you calculated that say 999,123 numbers fall into [0.00, 0.01], then you produce that number of random numbers in the interval and use the same approximation for all numbers in the interval.

To find an approximation of f (x) in the interval [0.33 .. 0.34], just as an example, you approximate f (0.335 + x / 200) for x in [-1 .. 1]. You'll get reasonably good results by taking an interpolating polynomial of degree n, interpolating at the Chebysev nodes xk = cos (pi * (2k - 1) / 2n).

By the way, performance of the old x87 trigonometric and logarithmic operations is slow. Absolutely nowhere near evaluating a low degree polynomial. And with sufficiently small intervals, you don't need a high polynomial degree.

Thank you for this. To get 32bit accuracy what size table do you need for your last suggestion? — Simd, Jun 01 '14 at 06:52

mctylr · Answer 5 · 2014-06-06T14:57:15.393

The processor likely implements tan() and cos() as native instructions (hardwired or microcode) FPTAN (x87+) and FCOS (387+) for the x86/87 (87 coming from the original math coprocessor, Intel 8087).

So ideally your environment should generate and execute native x87 instructions, namely FCOS and FPTAN (partial tan). You can save the generated assembly code using the -S flag with gcc to explicitly generate assembly language output and search for those instructions. If not, verify the usage of flags enabling the generation for the correct processor submodel (or closet available) for gcc.

I don't believe there are any SIMD instruction sets (MMX, SSE, 3dNow, etc.) that handle functions such log(), tan(), cos(), so that is not an (direct) option, but the SIMD instructions are great for doing interpolation from previously computed results or from a table.

Another tact would be to try some of the math optimizations options available with the GCC compiler. Such as -ffast-math which can be dangerous if you do not understand the implications. The rounding option may be suffice if the speed issue is merely related to the difference between the x87's native 80-bit extended precision and the 64-bit IEEE 754 standard double precision numbers.

I don't expect you can easily write an approximation that is suitable and correct for 32-bit floating or fixed point number and make it faster than the native FPU instructions. It is not clear how accurately you need/want to follow the particular distribution curve, as with most things relating to PRNG, the devil is in the minute details.

While ensuring you are at least using native assembly floating point instructions for the basic elementary (transcendental) math functions is a good starting point, perhaps the best performance improvement is to exploit mathematical simplifications such as suggested by tmyklebu and gnasher729 in their answers.

Next, creating an approximation of the non-uniform distribution function, as suggested by @tmyklebu in their answer, and others, of creating a minimax approximation using the Remez Algorithm of this distribution function would be the best approach. That is rather than create approximations of individual elementary math functions (log, cos, etc.) create a single polynomial approximation of the entire distribution mapping function.

Beyond that I would recommend two books for modern floating point methods and algorithms, Elementary Functions, Algorithms and Implementation, 2nd ed. and Handbook of Floating-Point Arithmetic both by Jean-Michel Muller (editor of the second title). The first is more implementation oriented while the second is very comprehensive yet still easy to understand.

With either of those books you should be able to understand the precision vs. speed trade-offs for you situation and write a sufficient implementation.

Personally, I do not recommend using Hart's Computer Approximations (1968, or the 1978 reprint) it is simply too dated and too removed from modern computer hardware to recommend but is easy to find a used or library copy of, or Jack Crenshaw's Math Toolkit for Real-Time Programming, which is really oriented for non-precision embedded applications.

Jack Ganssle has two pieces introducing approximation for embedded applications, Approximations for Roots and Exponentials and A Guide to Approximations (PDF). While I absolutely do not recommend the given formulas for 32(+)-bit processors particularly if they have a FPU, they are a gentle introduction to the basics.

score 2 · Answer 6 · edited May 23 '17 at 10:32

2

How does C compute sin() and other math functions?
Not really feasible. A table for 32 bits of accuracy (which means you want fixed point math not doubles, but I digress would have to be (2^32)*4 bytes long. You may be able to shrink that some if your "32 bits of accuracy" is an output not an input (AKA the input range of 0 to 2PI is represented in < 32 bits, which is the only way you'd be able to represent angles outside 0 to 2PI anyway). This would be over the memory space for non-64 bit computers, and over many computer's amount of RAM.

edited May 23 '17 at 10:32

Community

1
1

answered May 23 '14 at 16:56

IdeaHat

7,641
1
22
53

2

You don't need a complete table of 2^32 words. Using techniques like interpolation or even CORDIC requires much smaller tables. – markgz May 23 '14 at 17:12
@markgz Yes but as soon as you start introducing those techniques, the "only 32 bit accurate" question comes in (with 32 bit accuracy being far too accurate to be necessary). – IdeaHat May 23 '14 at 17:32
Actually with implementing trianomitry functions you normally do range reduction, down to \{ 0 \dots \frac{\pi}{2} \} or \{ \frac{-\Pi}{4} \dots \frac{\pi}{4} \} – mctylr May 23 '14 at 18:35
I should of said 0 .. Pi/4, as the single reduction target. The result can be adjusted with simple transforms based on the input's location on the unit circle (which quadrate) to find the correct answer. – mctylr May 23 '14 at 18:53
1

@MadScienceDreams: Getting 32 correct bits really isn't hard with tables. – tmyklebu May 23 '14 at 20:54

score 2 · Answer 7 · edited May 31 '14 at 23:52

Like you've said, some transcendental functions like sine, cosine and tangent are available as assembly instructions in the x86 architecture. These are likely how the C library implements sin(), cos(), tan() and friends.

However, I did some fiddling with these instructions a while back, reimplementing the functions as macros, and removing every error checking and validation to leave just the bare minimum. Testing against the C library, I remember my macro functions where quite faster. Here is an example of my custom tangent function (forgive the Visual Studio assembly syntax):

#define machine_tan_d(result, x)\
__asm {\
    fld qword ptr [x]\
    fptan\
    fstp st(0)\
    fstp qword ptr [result]\
}

So if you are willing to make some assumptions, remove error handling/validation and make your code platform specific, then you might be able to squeeze a few cycles by using a macro function like I did.

Now about the second topic, using a lookup table, I wouldn't be so sure it is faster just because you would be using integer operations. A table of integers would place extra overhead in the data cache, possibly resulting in more frequent cache misses and worst execution time than the float operations. But this, of course, can only be inferred with careful profiling and benchmarking.

Why we should use x87 in era of sse2, and when (64-bit) doubles are used? — osgx, Jun 01 '14 at 00:16
Agreed. I wrote that circa 2009, but yeah, it is worth more as curiosity than for any real life use. There isn't much benefit in writing assembly code today, with compiler intrinsics and whatnot available. — glampert, Jun 01 '14 at 03:19
@osgx I think the real speedup will come from using AVX.. but first I need to find someone who knows how to code for AVX! — Simd, Jun 01 '14 at 12:05

score 0 · Answer 8 · answered May 23 '14 at 17:59

0

1) "It depends"... Depends on the compiler more than the architecture of the chip. 2) Back in the olden days it was popular to use the CORDIC methods to implement trig functions. http://en.wikipedia.org/wiki/CORDIC

answered May 23 '14 at 17:59

Brian Knoblauch

20,639
15
57
92

The compiler is gcc in this case. – Simd May 23 '14 at 18:58
1

I don't think, CORDIC is competitive anymore. The wikipedia article cites as its main advantage that it doesn't have to do multiplications. But multiplications are very cheap with modern hardware. And you need quite a lot of iterations with CORDIC to get decent precision if I'm not much mistaken. – cmaster - reinstate monica May 31 '14 at 19:30
@cmaster Right. The speed problems in this problem are cos, log and the RNG I think. cos is particularly slow it seems. – Simd Jun 01 '14 at 12:06

How to speed up tricky random number generation

8 Answers8