x87 FPU computing e powered x, maybe with a Taylor series?

Question

I am trying to compute the function e^x (x is a single-precision float) in x87. To achieve this I am using the Taylor-series (as a reminder: e^x = 1 + (x^1)/1! + (x^2)/2! + ... + (x^n)/n!)

As I'm working with the x87, I can calculate all values in extended precision(80bit instead of the single-precision 32bit).

My realization so far is: I have the summand and the sum in two seperate registers in the FPU (plus some other less significant things). I have a loop, that keeps updating the summand and adding it to the total sum. So in vague pseudo-code:

loop:
;update my summand
n = n + 1
summand = summand / n
summand = summand * x
;add the summand to the total sum
sum = sum + summand

My problem now is the exiting condition of the loop. I wanted to design it in a way that it would exit the loop once adding the summand to the total sum would not impact the value of the sum in single-precision, even though I am figuring out the hard way that implementing such an exit condition is very complex and uses up a lot of instructions -> computing time.

My only kind of okayish idea so far would be: 1.: Get the exponents of the sum and summand via FXTRACT. If (exp(sum) - exp(summand)) > 23, the summand will not impact the bit representation in single-precision any longer (because the mantissa in single-precision has 23 bits) --> so exit. 2.: Compare the summand with 0, if it is zero it will obviously not impact the result anymore either.

My question is, whether somebody would have any more efficient ideas than me for the exiting condition?

I know that's not your question, but why are you using the Taylor series? — Jester, Jul 06 '17 at 19:17
Now finding out about the difficulty of the breaking condition... let's just call it a beginners mistake, it's my first proper x87 project ;-) — Maru, Jul 06 '17 at 19:33
If you're lazy you can implement exp with `fyl2x` and `f2xmi`, they're some of the slowest instructions though. I'd probably do range-reduction and use a Padé approximation. — harold, Jul 06 '17 at 19:40
But if you're committed to doing the Taylor thing, you can also use as exit condition that your number has stopped changing. You *could* test that manually like you're suggesting, but if adding the number won't make any difference you can simply do the addition and then detect that it made no difference. — harold, Jul 06 '17 at 19:42
@harold that exactly being my thought as well, how do i detect that change? no change occuring in the 80bit precision doesn't mean there would be no change in the 32bit precision, so even though exiting wouldn't make a difference in the returned value I won't exit because the condition isn't detected. — Maru, Jul 06 '17 at 19:49
You could use [code like this](https://gist.github.com/fuzxxl/b0ceb54c42a6d4b0add1684ecc3abf80). Note that even though we use `frndint`, the actual rounding mode doesn't matter as this is just used to split the number into an integral and a fractional part. — fuz, Jul 06 '17 at 19:54
@MaruanElMahgary you could compare the 32bit result. That seems slow, but `fxtract` is even slower. — harold, Jul 06 '17 at 19:55
@harold how exactly could I compare the 32bit result? only way i know to get the 32bit is to push the 80bit value onto the stack, automatically transfering it into 32 bit, and then reload it back onto the stack, but that seems very impractical... — Maru, Jul 06 '17 at 20:12
@MaruanElMahgary well that's how.. it feels worse than it is. `fstp` and `fld` are 1-µop and (relatively) fast, while special instructions such as `fxtract` are microcoded and slow, so if that's the alternative the "weird storing and reloading" is starting to look pretty good. A simpler alternative than all of this is just not having a dynamic exit condition. — harold, Jul 06 '17 at 20:22
@harold Okay great thanks for the help, and thanks to everyone else for the input as well! I'll consider all of this in this and future projects! :) — Maru, Jul 06 '17 at 20:27

Cody Gray - on strike · Answer 1 · 2017-07-08T08:35:31.557

Perhaps the most efficient way to compute e^x using the x87 floating-point unit is the following sequence of instructions:

; Computes e^x via the formula 2^(x * log2(e))
fldl2e                  ; st(0) = log2(e)        <-- load log2(e)
fmul    [x]             ; st(0) = x * log2(e)
fld1                    ; st(0) = 1              <-- load 1
                        ; st(1) = x * log2(e)
fld     st(1)           ; st(0) = x * log2(e)    <-- make copy of intermediate result
                        ; st(1) = 1
                        ; st(2) = x * log2(e)
fprem                   ; st(0) = partial remainder((x * log2(e)) / 1)  <-- call this "rem"
                        ; st(1) = 1
                        ; st(2) = x * log2(e)
f2xm1                   ; st(0) = 2^(rem) - 1
                        ; st(1) = 1
                        ; st(2) = x * log2(e)
faddp   st(1), st(0)    ; st(0) = 2^(rem) - 1 + 1 = 2^(rem)
                        ; st(1) = x * log2(e)
fscale                  ; st(0) = 2^(rem) * 2^(trunc(x * log2(e)))
                        ; st(1) = x * log2(e)
fstp    st(1)

The result is left in st(0).

If you already have the input, x, on the floating-point stack, then you can modify the sequence of instructions slightly:

; st(0) == x
fldl2e
fmulp   st(1), st(0)
fld1
fld     st(1)
fprem
f2xm1
faddp   st(1), st(0)
fscale
fstp    st(1)

This multiplies x by log₂e, finds the remainder of that divided by 1, exponentiates 2 to the power of that value (f2xm1 also subtracts 1, then we add 1 back in), and finally scales that by x × log₂e.

An alternative implementation—essentially that suggested by fuz and reminiscent of the code generated by MSVC for the exp function from the C standard library—is the following:

; st(0) == x
fldl2e
fmulp    st(1), st(0)
fld      st(0)
frndint
fsub     st(1), st(0)
fxch                         ; pairable with FSUB on Pentium (P5)
f2xm1
fld1
faddp    st(1), st(0)
fscale
fstp     st(1)

The primary difference being the use of frndint and fsub to get a value in the range of −1.0 to +1.0, as required by f2xm1, as opposed to using fprem to get the remainder after a division by 1.

To get an idea of the relative costs of these instructions, we'll pull data from Agner Fog's instruction tables. A "?" indicates that the corresponding data is unavailable.

Instruction      AMD K7      AMD K8      AMD K10    Bulldozer    Piledriver    Ryzen
-------------------------------------------------------------------------------------------
FLD/FLD1/FLDL2E  [all very fast, 1-cycle instructions, with a reciprocal throughput of 1]
FADD(P)/FSUB(P)  1/4/1       1/4/1       1/4/1      1/5-6/1      1/5-6/1       1/5/1
FMUL(P)          1/4/1       1/4/1       1/4/1      1/5-6/1      1/5-6/1       1/5/1
FPREM            1/7-10/8    1/7-10/8    1/?/7      1/19-62/?    1/17-60/?     2/?/12-50
FRNDINT          5/10/3      5/10/3      6/?/37     1/4/1        1/4/1         1/4/3
FSCALE           5/8/?       5/8/?       5/9/29     8/52/?       8/44/5        8/9/4
F2XM1            8/27/?      53/126/?    8/65/30?   10/64-71/?   10/60-73/?    10/50/?

The notation used above is "ops／latency／reciprocal throughput".

Instruction      8087     287      387      486      P5         P6        PM      Nehalem
-------------------------------------------------------------------------------------------
FLD              17-22    17-22    14       4        1/0        1/?       1/1      1/1
FLD1             15-21    15-21    24       4        2/0        2/?       2/?      2/?
FLDL2E           15-21    15-21    40       8        5/2        2/?       2/?      2/?
FADD(P)/FSUB(P)  70-100   70-100   23-34    8-20     3/2        1/3       1/3      1/3
FMUL(P)          90-145   90-145   29-57    16       3/2        1/5       1/5      1/5
FPREM            15-190   15-190   74-155   70-138   16-64 (2)  23/?      26/37    25/14
FRNDINT          16-50    16-50    66-80    21-30    9-20 (0)   30/?      15/19    17/22
FSCALE           32-38    32-38    67-86    30-32    20-32 (5)  56/?      28/43    24/12
F2XM1            310-630  310-630  211-476  140-279  13-57 (2)  17-48/66  ~20/?    19/58

For pre-P5, the value is just cycle counts. For P5, the notation is cycle counts, with the parenthetical indicating overlap with other FP instructions. For P6 and later, the notation is µops／latency.

Clearly, f2xm1 is a slow, expensive instruction, but this used by both implementations and hard to avoid. As slow as it is, it's still way faster than implementing this as a loop.

(One possible way around the slow f2xm1 instruction—if you are willing to sacrifice code size for speed—would be a look-up table-based approach. You can find several papers published on this if you search, although most of them are behind paywalls. :-( Here's one publicly-accessible reference. In fact, this is probably what an optimized math library written in C would do for its implementation of exp. Ben Voigt wrote similar code in C#. The basic question is here, as always, are you optimizing for size or speed? In other words, are you calling this function frequently and need it to be as fast as possible? Or are you calling it infrequently and just need it to be precise, without consuming too much space in the binary and potentially slowing down other code on the hot path by evicting it from the cache.)

But what about fprem vs. frndint?

Well, on AMD CPUs, fprem pretty consistently decodes to fewer operations than frndint, although frndint is getting faster than fprem on modern AMD CPUs. Then again, I'd say that's irrelevant, because on those CPUs, you'd be writing SSE2/AVX code, not using the legacy x87 FPU. (These instructions also get slower on later model Intel microarchitectures, like Sandy Bridge, Haswell, etc., but the same argument applies there, so I've ignored those altogether in compiling the data.) What really matters is how this code performs on older CPUs, and on older CPUs, the version that uses fprem looks like a clear winner.

On Intel CPUs, however, the reverse appears to be true: frndint is generally faster than fprem, except on P6 (Pentium Pro, Pentium II, and Pentium III).

But, we're not just talking about frndint vs fprem—we're actually talking about fprndint+fsub vs. fprem. fsub is a relatively fast instruction, but it starts to get difficult to predict just what performance this code is going to have without executing it and timing it. Cycle counts can only tell us so much, and the fprem sequence is shorter overall (fewer instructions and, more importantly, fewer bytes—18 vs. 22), which can mean substantial speed improvements.

Either implementation is good, if you don't want to bother benchmarking it, or you want something generic that runs well on all CPUs. Neither of them are going to set the world on fire in terms of performance, but as I said before, both will be significantly faster than a loop that tries to calculate a Taylor series. The overhead there is what will kill your performance.

I suspect that even on P5, an implementation just using simple x87 instructions like multiply and add might be fastest. I don't know about accuracy; I know the x87 trig functions aren't great, especially when range-reduction causes issues, but IDK if the internal value of `e` is better than the x87 value of `pi` which is [only accurate to 66 bits](https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/). I think good math-library implementations typically did everything using basic operations even in the x87 days before SSE. — Peter Cordes, Jul 08 '17 at 06:28
If this sounds pretty hand-wavey, it is. I'm just assuming that there's some fast-converging method for e^x. Can we do anything with `2^int(x)`, which we can get by stuffing the integer part of `x` into the exponent of an IEEE binary float? Probably not; the reverse works for `ln(x)` and `log10(x)` because `ln(x) = log2(x) * ln(2)`, then you just need a polynomial approximation for `log2(x)` over the domain of the mantissa. But `e^x / 2^x` is `(e/2)^x`, which isn't a constant. — Peter Cordes, Jul 08 '17 at 06:36
You may be right about multiply and add being faster, @Peter. I'm not actually sure without testing it. *But* as slow as loops and instruction fetching are on older processors, I doubt it holds true there. Maybe on P5, and probably even more so on P6, where the FPU was finally (partially) pipelined. It isn't true, though, that good math-lib implementations used to use only basic ops. Really good optimizing compilers, like Watcom, actually *did* generate inline complex x87 FPU instructions like the sequences above. In fact, my "most efficient" implementation is basically what Watcom used to... — Cody Gray - on strike, Jul 08 '17 at 07:48
…use for the `exp` function, and it's what modern versions of MSVC continue to use when SSE instructions are disabled. I'm not sure how far back this lineage goes on MSVC, but I suspect it's quite a while, since there's been no recent development on x87 code generation! I think the internal value of `e` is accurate; I haven't heard anything to suggest otherwise, and the issue with `pi` is just a bug, plain and simple. The trig insns were new in the 287, and probably were just hastily shipped without being fully tested. Would be nice to do something to replace `f2xm1`, though. — Cody Gray - on strike, Jul 08 '17 at 07:51
Probably if you could spare the bytes, a look-up table would be faster. My mathematics knowledge is relatively weak, but I see papers published on that. [Here's one reference](http://www.convict.lu/Jeunes/ultimate_stuff/exp_ln_2.htm). That is probably what a fast math lib written in C would do. — Cody Gray - on strike, Jul 08 '17 at 07:54
Yeah, I wouldn't be surprised if a small LUT made a useful part of an `exp(x)` function for older CPUs where the relative costs of FP computation vs. a cache miss was a lot different than now. Or maybe even still now... I recently optimized the crap out of AVX2+FMA asinh and log10 functions (extract exponent and use a polynomial for the mantissa works great, and even better with AVX512 `vgetexp` and `vgetmant` :), but haven't looked at `exp()`. — Peter Cordes, Jul 08 '17 at 07:58
Bit-twiddling of FP values works *much* better with SSE/AVX than it does with the legacy x87 because you don't have to round-trip values through memory. Interestingly, though, my research turned up [this page](https://www.ajjacobson.us/recombining-tree/how-slow-is-exp.html), which looks like a crappy/incomplete online mirror of [this book](https://www.amazon.com/Monte-Carlo-Methods-Finance-Jaeckel/dp/047149741X), which includes a chapter entitled "How slow is exp?". It discusses essentially the optimized x87 implementation I present here, and... — Cody Gray - on strike, Jul 08 '17 at 08:17
...[claims that, as of the Pentium III (and possibly earlier), a LUT-based approach would give no considerable speedup](https://www.ajjacobson.us/recombining-tree/ex-[2xlog2e-1-1142.html): *"Since the introduction of the Pentium III model, even the fastest implementation of a lookup table based exp() replacement does no longer provide any speedup when compared with the optimised system given code for this function."* I wish I could find the original text somewhere, but I don't want to order the book just out of curiosity! — Cody Gray - on strike, Jul 08 '17 at 08:17
Related: [Fastest Implementation of Exponential Function Using AVX](//stackoverflow.com/q/48863719) shows how to implement exp(x) with bit-stuffing for the truncated-integer part of the exponent, and a polynomial for the fractional part. For 32-bit float, the polynomial doesn't have to be too high a degree. For a 64-bit double or 80-bit x87, more terms would be needed to get enough accuracy. (Or a ratio of two polynomials is often good.) VCL https://github.com/vectorclass/version2/blob/46a58414bc6128b61d77186656860ca7b0b961d8/vectormath_exp.h#L139 has `double` Taylor and Pade versions. — Peter Cordes, Dec 19 '20 at 18:04

x87 FPU computing e powered x, maybe with a Taylor series?

1 Answers1