I do not want correct rounding for function exp

Question

The GCC implementation of the C mathematical library on Debian systems has apparently an (IEEE 754-2008)-compliant implementation of the function exp, implying that rounding shall always be correct:

(from Wikipedia) The IEEE floating point standard guarantees that add, subtract, multiply, divide, fused multiply–add, square root, and floating point remainder will give the correctly rounded result of the infinite precision operation. No such guarantee was given in the 1985 standard for more complex functions and they are typically only accurate to within the last bit at best. However, the 2008 standard guarantees that conforming implementations will give correctly rounded results which respect the active rounding mode; implementation of the functions, however, is optional.

It turns out that I am encountering a case where this feature is actually hindering, because the exact result of the exp function is often nearly exactly at the middle between two consecutive double values (1), and then the program carries plenty of several further computations, losing up to a factor 400 (!) in speed: this was actually the explanation to my (ill-asked :-S) Question #43530011.

(1) More precisely, this happens when the argument of exp turns out to be of the form (2 k + 1) × 2^-53 with k a rather small integer (like 242 for instance). In particular, the computations involved by pow (1. + x, 0.5) tend to call exp with such an argument when x is of the order of magnitude of 2^-44.

Since implementations of correct rounding can be so much time-consuming in certain circumstances, I guess that the developers will also have devised a way to get a slightly less precise result (say, only up to 0.6 ULP or something like this) in a time which is (roughly) bounded for every value of the argument in a given range… (2)

… But how to do this??

(2) What I mean is that I just do not want that some exceptional values of the argument like (2 k + 1) × 2^-53 would be much more time-consuming than most values of the same order of magnitude; but of course I do not mind if some exceptional values of the argument go much faster, or if large arguments (in absolute value) need a larger computation time.

Here is a minimal program showing the phenomenon:

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>

int main (void)
 {
  int i;
  double a, c;
  c = 0;
  clock_t start = clock ();
  for (i = 0; i < 1e6; ++i) // Doing a large number of times the same type of computation with different values, to smoothen random fluctuations.
   {
    a = (double) (1 + 2 * (rand () % 0x400)) / 0x20000000000000; // "a" has only a few significant digits, and its last non-zero digit is at (fixed-point) position 53.
    c += exp (a); // Just to be sure that the compiler will actually perform the computation of exp (a).
   }
  clock_t stop = clock ();
  printf ("%e\n", c); // Just to be sure that the compiler will actually perform the computation.
  printf ("Clock time spent: %d\n", stop - start);
  return 0;
 }

Now after gcc -std=c99 program53.c -lm -o program53:

$ ./program53
1.000000e+06
Clock time spent: 13470008
$ ./program53 
1.000000e+06
Clock time spent: 13292721
$ ./program53 
1.000000e+06
Clock time spent: 13201616

On the other hand, with program52 and program54 (got by replacing 0x20000000000000 by resp. 0x10000000000000 and 0x40000000000000):

$ ./program52
1.000000e+06
Clock time spent: 83594
$ ./program52
1.000000e+06
Clock time spent: 69095
$ ./program52
1.000000e+06
Clock time spent: 54694
$ ./program54
1.000000e+06
Clock time spent: 86151
$ ./program54
1.000000e+06
Clock time spent: 74209
$ ./program54
1.000000e+06
Clock time spent: 78612

Beware, the phenomenon is implementation-dependent! Apparently, among the common implementations, only those of the Debian systems (including Ubuntu) show this phenomenon.

P.-S.: I hope that my question is not a duplicate: I searched for a similar question thoroughly without success, but maybe I did note use the relevant keywords… :-/

I would write my own `exp_approx`, using an existing implementation of `exp` as a starting point. — aschepler, Jun 03 '17 at 16:58
Why are you doing `pow(1.0 + x, 0.5)` rather than the obvious (and fast) `sqrt(1.0 + x)`? — EOF, Jun 03 '17 at 16:59
@EOF: In my actual program `0.5` is actually a parameter input by the user, which may take any value between 0 and 1. By the way, I realized that there was something strange going on when I found that the program speed crashed down completely when I input an paramter of `0.5` rather than `0.49` or `0.51`… ;-) — Rémi Peyre, Jun 03 '17 at 17:19
@Nancy-N Could you post a [MCVE]? If the problem is specific to a power of 0.5 *exactly*, you could special-case it with `sqrt()`. — EOF, Jun 03 '17 at 17:26
Please edit question instead of adding useful info in comment. — Yunnosch, Jun 03 '17 at 18:07
When you're taking the `exp()` of numbers very close to `0`, floating-point numbers are problematic, since you get a number close to `1` while all the precision is in the *difference* to one, so you lose most significant digits. It is more precise (and *significantly* faster in your testcase) to compute `exp(x) - 1` through the C math library function `expm1(x)`. If you *really* need the `exp()`, it's still much faster to do `expm1(x) + 1`. A similar concern exists for computing `log(1 + x)`, for which there is the function `log1p()`. — EOF, Jun 03 '17 at 18:41
@EOF Sounds like a good answer to a good question. Please make a nice Q/A pair. — Yunnosch, Jun 03 '17 at 18:53
@Yunnosch I'm pretty frustrated with this question. It *feels* like it could be an interesting question, but the lack of reason behind the testcase (it has next to nothing to do with the original description in the question) makes it difficult to write a comprehensive answer. — EOF, Jun 03 '17 at 18:56
@EOF I did not know `expm1`; indeed it solves the problem in this case, as the value of `exp (x) - 1` will *not* be at the (almost) middle between two consecutive `double` numbers (though `exp (x)` is…). So thank you for this suggestion! :-) However I would still be interested by an implementation of `exp` not looking for correct rounding: first because using `exp` makes the code easier to understand, but above all because I think that it would be good anyway to be certain that no peculiar value of your argument can crash the speed of your function ;-) — Rémi Peyre, Jun 03 '17 at 21:27
@EOF I do not quite understand what you mean by “the lack of reason behind the testcase” and why you feel it difficult to write a comprehensive answer… Could you be more specific? Personally, the way I see it, there are two distinct points in my question: first, a problem of code speed for one peculiar program, to which your `expm1` suggestion answers well; and next, motivated by the first point but having its own interest, looking for an implementation of `exp` slightly less precise but with decent speed for *all* the value of its argument. — Rémi Peyre, Jun 03 '17 at 21:32
@Nancy-N Where you see a precise `exp()` wasting your time with meaningless precision (after all, `double` has plenty of bits *anyway*, right?), I see a library function *desperately* trying to protect the last *handful* of bits of precision from the ravages of your poorly chosen computation. You *can* of course stab it in the back and use a less accurate approximation, but don't be surprised if you end up with a final result that is barely in the right order of magnitude. — EOF, Jun 05 '17 at 10:33
@Nancy-N See, `log(1 + x)` is approximately `x` for `abs(x)` much smaller than one. Now, if `x` is `pow(2, -44)`, `exp(log(1 + x))` is `1 + x`, but of the significant bits of `x` you have *at best* `53 - 44 == 9` bits of precision left. That's only `512` different *possible results* for your inputs in that range. How many more bits can you spare for a faster approximation? — EOF, Jun 05 '17 at 10:39
@EOF I see your point; however (which you could not guess) it is not relevant in my actual context. In the program where I encountered the issue, what I do is convoluting some function with another function defined in terms of `exp`, which leads to making the sum of a handful of terms using `exp` with quite different arguments. Long story short, in this context I do not mind having a crappy computation of `expm1 (x)`, provided my computation of `exp (x)` is fair. — Rémi Peyre, Jun 06 '17 at 12:11
… And of course, however clever it may be to use expm1, my main question still remains: have the developers of the mathematical library devised a way to disregard exact rounding; and if yes, what is this way…? — Rémi Peyre, Jun 06 '17 at 12:21
@Nancy-N: And when I asked for a testcase, I'd have *liked* to see the *actual usecase*, or something *a bit* simplified. Something where the part that I "could not have guessed" is represented. It's the "complete" part of [MCVE]. — EOF, Jun 06 '17 at 15:14
@Nancy-N: To answer the specific question you insist on: Floating-point is hard, and often times counterintuitive. Not every programmer has read [what they should have](http://download.oracle.com/docs/cd/E19957-01/806-3568/ncg_goldberg.html). When libraries *used to* allow some slightly inaccurate rounding, people complained about the precision of the library function when their inaccurate computations inevitably went wrong and produced nonsense. In response, the library writers made their libraries exactly rounded, so now people can't shift the blame to them. — EOF, Jun 06 '17 at 15:30
`exp` is not provided by GCC, but by your [C standard library](https://en.wikipedia.org/wiki/C_standard_library) i.e. `libc.so` and `libm.so`. However, read http://floating-point-gui.de/ — Basile Starynkevitch, Jun 07 '17 at 04:41
@M.M No it does not help :-/ I guess this is because the request for correct rounding comes from the mathematical library, not from the compiler itself. — Rémi Peyre, Jun 07 '17 at 19:44
@EOF Thank you for this explanation about the reasons behind the choice of the developers of the mathematical library :-) —or rather of the devisers of the IEEE 754-2008 standard, in this context. Maybe you could combine this comment with the latest comment of NominalAnimal under your answer (the comment saying “Take a look at ACML and Intel MKL…”) into a standard answer, which I would then validate? ;-) — Rémi Peyre, Jun 07 '17 at 20:05
@EOF Concerning the actual usecase, honestly there would no point in showing it: - First, because (an excerpt of) the actual program would not be enlightening at all: I had written it to perform numerical investigation on some particular random object which I needed to understand better for some purely mathematical research… (I am a mathematician ;-) ). - Second, because my problem could actually be bypassed by some classical code optimization: as the line of code in which I used the exponential would only call ~100 different arguments, I just pre-computed all the results in a table! ;-) — Rémi Peyre, Jun 07 '17 at 20:36
@Nancy-N: Ah, memoization is something I didn't think of (at least somewhat because it's a bit situational. There are *easily* enough `double` values in the range around `pow(2, -44)` to make memoization inapplicable in general). Either way, it *feels* like there might be a lot of fun optimizations that could have been discussed here if more of the context was given. I like discussing this kind of thing ;-). — EOF, Jun 07 '17 at 20:42
@EOF (continued) … So, it was mainly by curiosity that I eagerly wanted to understand where the problem came from (which I managed to do mainly by myself: it was this requirement for exact rounding). Once I found the cause of the phenomenon, I wondered whether there was a way to be sure that I would never encounter such a problem again, by re`#define`ing the `exp` function as something with only 1 ULP precision, but never very slow… — Rémi Peyre, Jun 07 '17 at 20:44
@EOF And just for fun, here is the original line of code which caused me that trouble: `noise1 = (1 - pow (1 - exp (x), 2 * H)) / (2 * H);` where `x` was a variable signed number of order of magnitude 16, and `H` was a “Hurst parameter”, remaining the same during one program run, but likely to take any value between 0 and 1 (excluded). I was running my program for different values of `H`, and was struck to observe that the case `H = 0.25` was running some 10 times slower than all the other values… — Rémi Peyre, Jun 07 '17 at 20:55
@Nancy-N I believe `(-expm1(2 * H * log1p(-exp(x)))) / (2 * H)` will give very fast (and much more accurate) results for the line you posted ;-) — EOF, Jun 07 '17 at 21:55

score 11 · Accepted Answer · edited Jun 20 '20 at 09:12

11

To answer the general question on why the library functions are required to give correctly rounded results:

Floating-point is hard, and often times counterintuitive. Not every programmer has read what they should have. When libraries used to allow some slightly inaccurate rounding, people complained about the precision of the library function when their inaccurate computations inevitably went wrong and produced nonsense. In response, the library writers made their libraries exactly rounded, so now people cannot shift the blame to them.

In many cases, specific knowledge about floating point algorithms can produce considerable improvements to accuracy and/or performance, like in the testcase:

Taking the exp() of numbers very close to 0 in floating-point numbers is problematic, since the result is a number close to 1 while all the precision is in the difference to one, so most significant digits are lost. It is more precise (and significantly faster in this testcase) to compute exp(x) - 1 through the C math library function expm1(x). If the exp() itself is really needed, it is still much faster to do expm1(x) + 1.

A similar concern exists for computing log(1 + x), for which there is the function log1p(x).

A quick fix that speeds up the provided testcase:

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>

int main (void)
{
  int i;
  double a, c;
  c = 0;
  clock_t start = clock ();
  for (i = 0; i < 1e6; ++i) // Doing a large number of times the same type of computation with different values, to smoothen random fluctuations.
    {
      a = (double) (1 + 2 * (rand () % 0x400)) / 0x20000000000000; // "a" has only a few significant digits, and its last non-zero digit is at (fixed-point) position 53.
      c += expm1 (a) + 1; // replace exp() with expm1() + 1
    }
  clock_t stop = clock ();
  printf ("%e\n", c); // Just to be sure that the compiler will actually perform the computation.
  printf ("Clock time spent: %d\n", stop - start);
  return 0;
}

For this case, the timings on my machine are thus:

Original code

1.000000e+06

Clock time spent: 21543338

Modified code

1.000000e+06

Clock time spent: 55076

Programmers with advanced knowledge about the accompanying trade-offs may sometimes consider using approximate results where the precision is not critical

For an experienced programmer it may be possible to write an approximative implementation of a slow function using methods like Newton-Raphson, Taylor or Maclaurin polynomials, specifically inexactly rounded specialty functions from libraries like Intel's MKL, AMD's AMCL, relaxing the floating-point standard compliance of the compiler, reducing precision to ieee754 binary32 (float), or a combination of these.

Note that a better description of the problem would enable a better answer.

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 03 '17 at 19:01

EOF

6,273
2
26
50

5

Excellent suggestion, EOF. I examined a few billion uniformly pseudo-random -0.125 < x < 0.125, and 0 < y < 1. Calculating (1+x)**y using `pow(x+1, y)` takes median 2.6 times as long as `1.0 + expm1(y * log1p(x))`, with at most one unit in the least significant place difference between the two. Interestingly enough, 87.200% of the results (of the two functions) match, and only 12.800% differ. – Nominal Animal Jun 03 '17 at 20:50
Ditto, "excellent suggestion". I've been programming in C for a long time, and was never aware of those functions, which I could have used once or twice (though not all that often). Good to Know! Thanks. – John Forkosh Jun 03 '17 at 23:38
But I'm a little confused... won't expm1(small_number)+1.0 still lose the left-hand term, just like I'd suppose, e.g., (1.e-99) + 1.0 would presumably lose it? I'd think expm1() would mainly be useful when you don't need the +1 term to begin with (e.g., when exp()-1 is what you want). – John Forkosh Jun 03 '17 at 23:51
@JohnForkosh: I tried explaining that in the first couple of sentences. If you compute `exp(x)` for `abs(x)` much smaller than one, you get a number close to one where all the relevant bits are the *difference* to one. You *can* try continuing your computation with `exp(x)`, but your results will have terrible precision. It's much better to rewrite your expressions to use the mathematical value of `exp(x)-1`, and compute it using `expm1(x)`. – EOF Jun 05 '17 at 10:29
Thanks, @EOF , and I agree with your comment. But note the line in your code **c += expm1(a)+1;** So that will first evaluate **expm1(a)** as, let's say, **0.000...0001 = 1.e-whatever** (maybe expressed in some "excess" notation by typical fpu's). But then you **immediately** add **1.0** to that, which would seem to lose the "trailing" **.000...0001**, whenever it's sufficiently small, in exactly the way discussed. So by the time you get to your **c+=** part, the **expm1()** advantage is lost. Please see next comment... – John Forkosh Jun 05 '17 at 11:18
...continued. What I'd think you'd have to do is as follows. **Inside** the loop say **only** the **c+=expm1(a)** part. Then **after** the loop, after all those little numbers have been accumulated, you can maybe say c+=1.e6 (the number of passes through the loop), so that the total of all the little numbers isn't lost. But if that total's still too small, maybe **printf()** the integer part (number of passes) followed by the small **c** total. – John Forkosh Jun 05 '17 at 11:23
@JohnForkosh You are right. As I wrote, this answer is only a quick and dirty fix for the performance problem, rather than the (badly needed) fix for the abysmal precision of the OP's code. Unfortunately, the testcase doesn't really offer much to work with. – EOF Jun 05 '17 at 11:26
@NominalAnimal I find all of that quite interesting; however, I keep being interested in my original question: have the developers of the mathematical library devised a way to disregard exact rounding; and if yes, what is this way…? – Rémi Peyre Jun 06 '17 at 12:24
2

@Nancy-N: There are a slew of compiler options like GCC's `-freciprocal-math` that do trade precision for speed, but the `` library functions definitely strive for precision and not speed. Intel MKL does have `vmdExp()` that takes a mode parameter ("high accuracy", "low accuracy", "enhanced performance accuracy"). AMD AMCL has `fastexp()` (1 ULP precision, 75 clocks for most arguments). So, my answer to that is *"Well, no; but there are other libraries, and you can even write your own code, to provide less-precise more performant versions. Take a look at ACML and Intel MKL."* – Nominal Animal Jun 06 '17 at 14:36
@NominalAnimal Thank you, this is exactly the kind of answer I was looking for! :-D On my suggestion, EOF has combined your comment with one of his to make the answer that I have validated ;-) – Rémi Peyre Jun 07 '17 at 20:46

John Forkosh · Answer 2 · 2017-06-07T22:34:00.670

Regarding your comment to @EOF 's answer, the "write your own" remark from @NominalAnimal seems simple enough here, even trivial, as follows.

Your original code above seems to have a max possible argument for exp() of a=(1+2*0x400)/0x2000...=4.55e-13 (that should really be 2*0x3FF, and I'm counting 13 zeroes after your 0x2000... which makes it 2x16^13). So that 4.55e-13 max argument is very, very small.

And then the trivial taylor expansion is exp(a)=1+a+(a^2)/2+(a^3)/6+... which already gives you all double's precision for such small arguments. Now, you'll have to discard the 1 part, as explained above, and then that just reduces to expm1(a)=a*(1.+a*(1.+a/3.)/2.) And that should go pretty darn quick! Just make sure a stays small. If it gets a little bigger, just add the next term, a^4/24 (you see how to do that?).

>>EDIT<<

I modified the OP's test program as follows to test a little more stuff (discussion follows code)

/* https://stackoverflow.com/questions/44346371/
   i-do-not-want-correct-rounding-for-function-exp/44397261 */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define BASE 16               /*denominator will be (multiplier)xBASE^EXPON*/
#define EXPON 13
#define taylorm1(a) (a*(1.+a*(1.+a/3.)/2.)) /*expm1() approx for small args*/

int main (int argc, char *argv[]) {
  int N          = (argc>1?atoi(argv[1]):1e6),
      multiplier = (argc>2?atoi(argv[2]):2),
      isexp      = (argc>3?atoi(argv[3]):1); /* flags to turn on/off exp() */
  int isexpm1    = 1;                        /* and expm1() for timing tests*/
  int i, n=0;
  double denom = ((double)multiplier)*pow((double)BASE,(double)EXPON);
  double a, c=0.0, cm1=0.0, tm1=0.0;
  clock_t start = clock();
  n=0;  c=cm1=tm1=0.0;
  /* --- to smooth random fluctuations, do the same type of computation
         a large number of (N) times with different values --- */
  for (i=0; i<N; i++) {
    n++;
    a = (double)(1 + 2*(rand()%0x400)) / denom; /* "a" has only a few
                                 significant digits, and its last non-zero
                                 digit is at (fixed-point) position 53. */
    if ( isexp ) c += exp(a); /* turn this off to time expm1() alone */
    if ( isexpm1 ) {          /* you can turn this off to time exp() alone, */
      cm1 += expm1(a);        /* but difference is negligible */
      tm1 += taylorm1(a); }
    } /* --- end-of-for(i) --- */
  int nticks = (int)(clock()-start);
  printf ("N=%d, denom=%dx%d^%d, Clock time: %d (%.2f secs)\n",
         n, multiplier,BASE,EXPON,
         nticks, ((double)nticks)/((double)CLOCKS_PER_SEC));
  printf ("\t c=%.20e,\n\t c-n=%e, cm1=%e, tm1=%e\n",
           c,c-(double)n,cm1,tm1);
  return 0;
  } /* --- end-of-function main() --- */

Compile and run it as test to reproduce OP's 0x2000... scenario, or run it with (up to three) optional args test #trials multiplier timeexp where #trials defaults to the OP's 1000000, and multipler defaults to 2 for the OP's 2x16^13 (change it to 4, etc, for her other tests). For the last arg, timeexp, enter a 0 to do only the expm1() (and my unnecessary taylor-like) calculation. The point of that is to show that the bad-timing-cases displayed by the OP disappear with expm1(), which takes "no time at all" regardless of multiplier.

So default runs, test and test 1000000 4, produce (okay, I called the program rounding)...

bash-4.3$ ./rounding 
N=1000000, denom=2x16^13, Clock time: 11155070 (11.16 secs)
         c=1.00000000000000023283e+06,
         c-n=2.328306e-10, cm1=1.136017e-07, tm1=1.136017e-07
bash-4.3$ ./rounding 1000000 4
N=1000000, denom=4x16^13, Clock time: 200211 (0.20 secs)
         c=1.00000000000000011642e+06,
         c-n=1.164153e-10, cm1=5.680083e-08, tm1=5.680083e-08

So the first thing you'll note is that the OP's c-n using exp() differs substantially from both cm1==tm1 using expm1() and my taylor approx. If you reduce N they come into agreement, as follows...

N=10, denom=2x16^13, Clock time: 941 (0.00 secs)
         c=1.00000000000007140954e+01,
         c-n=7.140954e-13, cm1=7.127632e-13, tm1=7.127632e-13
bash-4.3$ ./rounding 100
N=100, denom=2x16^13, Clock time: 5506 (0.01 secs)
         c=1.00000000000010103918e+02,
         c-n=1.010392e-11, cm1=1.008393e-11, tm1=1.008393e-11
bash-4.3$ ./rounding 1000
N=1000, denom=2x16^13, Clock time: 44196 (0.04 secs)
         c=1.00000000000011345946e+03,
         c-n=1.134595e-10, cm1=1.140730e-10, tm1=1.140730e-10
bash-4.3$ ./rounding 10000
N=10000, denom=2x16^13, Clock time: 227215 (0.23 secs)
         c=1.00000000000002328306e+04,
         c-n=2.328306e-10, cm1=1.131288e-09, tm1=1.131288e-09
bash-4.3$ ./rounding 100000
N=100000, denom=2x16^13, Clock time: 1206348 (1.21 secs)
         c=1.00000000000000232831e+05,
         c-n=2.328306e-10, cm1=1.133611e-08, tm1=1.133611e-08

And as far as timing of exp() versus expm1() is concerned, see for yourself...

bash-4.3$ ./rounding 1000000 2  
N=1000000, denom=2x16^13, Clock time: 11168388 (11.17 secs)
         c=1.00000000000000023283e+06,
         c-n=2.328306e-10, cm1=1.136017e-07, tm1=1.136017e-07
bash-4.3$ ./rounding 1000000 2 0
N=1000000, denom=2x16^13, Clock time: 24064 (0.02 secs)
         c=0.00000000000000000000e+00,
         c-n=-1.000000e+06, cm1=1.136017e-07, tm1=1.136017e-07

Question: you'll note that once the exp() calculation reaches N=10000 trials, its sum remains constant regardless of larger N. Not sure why that would be happening.

>>__SECOND EDIT__<<

Okay, @EOF , "you made me look" with your "heirarchical accumulation" comment. And that indeed works to bring the exp() sum closer (much closer) to the (presumably correct) expm1() sum. The modified code's immediately below followed by a discussion. But one discussion note here: recall multiplier from above. That's gone, and in its same place is expon so that denominator is now 2^expon where the default is 53, matching OP's default (and I believe better matching how she was thinking about it). Okay, and here's the code...

/* https://stackoverflow.com/questions/44346371/
   i-do-not-want-correct-rounding-for-function-exp/44397261 */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define BASE 2                /*denominator=2^EXPON, 2^53=2x16^13 default */
#define EXPON 53
#define taylorm1(a) (a*(1.+a*(1.+a/3.)/2.)) /*expm1() approx for small args*/

int main (int argc, char *argv[]) {
  int N          = (argc>1?atoi(argv[1]):1e6),
      expon      = (argc>2?atoi(argv[2]):EXPON),
      isexp      = (argc>3?atoi(argv[3]):1), /* flags to turn on/off exp() */
      ncparts    = (argc>4?atoi(argv[4]):1), /* #partial sums for c */
      binsize    = (argc>5?atoi(argv[5]):10);/* #doubles to sum in each bin */
  int isexpm1    = 1;                        /* and expm1() for timing tests*/
  int i, n=0;
  double denom = pow((double)BASE,(double)expon);
  double a, c=0.0, cm1=0.0, tm1=0.0;
  double csums[10], cbins[10][65537]; /* c partial sums and heirarchy */
  int nbins[10], ibin=0;      /* start at lowest level */
  clock_t start = clock();
  n=0;  c=cm1=tm1=0.0;
  if ( ncparts > 65536 ) ncparts=65536;  /* array size check */
  if ( ncparts > 1 ) for(i=0;i<ncparts;i++) cbins[0][i]=0.0; /*init bin#0*/
  /* --- to smooth random fluctuations, do the same type of computation
         a large number of (N) times with different values --- */
  for (i=0; i<N; i++) {
    n++;
    a = (double)(1 + 2*(rand()%0x400)) / denom; /* "a" has only a few
                                 significant digits, and its last non-zero
                                 digit is at (fixed-point) position 53. */
    if ( isexp ) {            /* turn this off to time expm1() alone */
      double expa = exp(a);   /* exp(a) */
      c += expa;              /* just accumulate in a single "bin" */
      if ( ncparts > 1 ) cbins[0][n%ncparts] += expa; } /* accum in ncparts */
    if ( isexpm1 ) {          /* you can turn this off to time exp() alone, */
      cm1 += expm1(a);        /* but difference is negligible */
      tm1 += taylorm1(a); }
    } /* --- end-of-for(i) --- */
  int nticks = (int)(clock()-start);
  if ( ncparts > 1 ) {        /* need to sum the partial-sum bins */
    nbins[ibin=0] = ncparts;  /* lowest-level has everything */
    while ( nbins[ibin] > binsize ) { /* need another heirarchy level */
      if ( ibin >= 9 ) break; /* no more bins */
      ibin++;                 /* next available heirarchy bin level */
      nbins[ibin] = (nbins[ibin-1]+(binsize-1))/binsize; /*#bins this level*/
      for(i=0;i<nbins[ibin];i++) cbins[ibin][i]=0.0; /* init bins */
      for(i=0;i<nbins[ibin-1];i++) {
        cbins[ibin][(i+1)%nbins[ibin]] += cbins[ibin-1][i]; /*accum in nbins*/
        csums[ibin-1] += cbins[ibin-1][i]; } /* accumulate in "one bin" */
      } /* --- end-of-while(nprevbins>binsize) --- */
    for(i=0;i<nbins[ibin];i++) csums[ibin] += cbins[ibin][i]; /*highest level*/
    } /* --- end-of-if(ncparts>1) --- */
  printf ("N=%d, denom=%d^%d, Clock time: %d (%.2f secs)\n", n, BASE,expon,
         nticks, ((double)nticks)/((double)CLOCKS_PER_SEC));
  printf ("\t c=%.20e,\n\t c-n=%e, cm1=%e, tm1=%e\n",
           c,c-(double)n,cm1,tm1);
  if ( ncparts > 1 ) { printf("\t binsize=%d...\n",binsize);
    for (i=0;i<=ibin;i++) /* display heirarchy */
      printf("\t level#%d: #bins=%5d, c-n=%e\n",
      i,nbins[i],csums[i]-(double)n); }
  return 0;
  } /* --- end-of-function main() --- */

Okay, and now you can notice two additional command-line args following the old timeexp. They are ncparts for the initial number of bins into which the entire #trials will be distributed. So at the lowest level of the heirarchy, each bin should (modulo bugs:) have the sum of #trials/ncparts doubles. The argument after that is binsize, which will be the number of doubles summed in each bin at every successive level, until the last level has fewer (or equal) #bins as binsize. So here's an example dividing 1000000 trials into 50000 bins, meaning 20doubles/bin at the lowest level, and 5doubles/bin thereafter...

bash-4.3$ ./rounding 1000000 53 1 50000 5 
N=1000000, denom=2^53, Clock time: 11129803 (11.13 secs)
         c=1.00000000000000465661e+06,
         c-n=4.656613e-09, cm1=1.136017e-07, tm1=1.136017e-07
         binsize=5...
         level#0: #bins=50000, c-n=4.656613e-09
         level#1: #bins=10002, c-n=1.734588e-08
         level#2: #bins= 2002, c-n=7.974450e-08
         level#3: #bins=  402, c-n=1.059379e-07
         level#4: #bins=   82, c-n=1.133885e-07
         level#5: #bins=   18, c-n=1.136214e-07
         level#6: #bins=    5, c-n=1.138542e-07

Note how the c-n for exp() converges pretty nicely towards the expm1() value. But note how it's best at level#5, and isn't converging uniformly at all. And note if you break the #trials into only 5000 initial bins, you get just as good a result,

bash-4.3$ ./rounding 1000000 53 1 5000 5
N=1000000, denom=2^53, Clock time: 11165924 (11.17 secs)
         c=1.00000000000003527384e+06,
         c-n=3.527384e-08, cm1=1.136017e-07, tm1=1.136017e-07
         binsize=5...
         level#0: #bins= 5000, c-n=3.527384e-08
         level#1: #bins= 1002, c-n=1.164153e-07
         level#2: #bins=  202, c-n=1.158332e-07
         level#3: #bins=   42, c-n=1.136214e-07
         level#4: #bins=   10, c-n=1.137378e-07
         level#5: #bins=    4, c-n=1.136214e-07

In fact, playing with ncparts and binsize doesn't seem to show much sensitivity, and it's not always "more is better" (i.e., less for binsize) either. So I'm not sure exactly what's going on. Could be a bug (or two), or could be yet another question for @EOF ...???

>>EDIT -- example showing pair addition "binary tree" heirarchy<<

Example below added as per @EOF 's comment (Note: re-copy preceding code. I had to edit nbins[ibin] calculation for each next level to nbins[ibin]=(nbins[ibin-1]+(binsize-1))/binsize; from nbins[ibin]=(nbins[ibin-1]+2*binsize)/binsize; which was "too conservative" to create ...16,8,4,2 sequence)

bash-4.3$ ./rounding 1024 53 1 512 2
N=1024, denom=2^53, Clock time: 36750 (0.04 secs)
         c=1.02400000000011573320e+03,
         c-n=1.157332e-10, cm1=1.164226e-10, tm1=1.164226e-10
         binsize=2...
         level#0: #bins=  512, c-n=1.159606e-10
         level#1: #bins=  256, c-n=1.166427e-10
         level#2: #bins=  128, c-n=1.166427e-10
         level#3: #bins=   64, c-n=1.161879e-10
         level#4: #bins=   32, c-n=1.166427e-10
         level#5: #bins=   16, c-n=1.166427e-10
         level#6: #bins=    8, c-n=1.166427e-10
         level#7: #bins=    4, c-n=1.166427e-10
         level#8: #bins=    2, c-n=1.164153e-10

>>EDIT -- to show @EOF's elegant solution in comment below<<

"Pair addition" can be elegantly accomplished recursively, as per @EOF's comment below, which I'm reproducing here. (Note case 0/1 at end-of-recursion to handle n even/odd.)

  /* Quoting from EOF's comment...
   What I (EOF) proposed is effectively a binary tree of additions:
   a+b+c+d+e+f+g+h as ((a+b)+(c+d))+((e+f)+(g+h)).
   Like this: Add adjacent pairs of elements, this produces
   a new sequence of n/2 elements.
   Recurse until only one element is left.
   (Note that this will require n/2 elements of storage,
   rather than a fixed number of bins like your implementation) */
  double trecu(double *vals, double sum, int n) {
      int midn = n/2;
      switch (n) {
        case  0: break;
        case  1: sum += *vals; break;
        default: sum = trecu(vals+midn, trecu(vals,sum,midn), n-midn); break; }
      return(sum);
      }

You might want to name Horner's rule if you propose it. Also, I'm pretty sure `expm1()` already uses something like this, so implementing it manually is probably not going to be of much use, especially since the OP appears to want a more generally applicable function than `expm1()`. Taylor or Maclaurin polynomials are a possibility in general, but you'd have to be careful with the radius of convergence. — EOF, Jun 06 '17 at 20:13
Agreed. Often it is looking at the underlying formula, rather than any single operation, that yields best rewards. For example, in molecular dynamics, classical potential models tend to look pretty much horrific, so they tend to be approximated using many easy cubic splines instead. Another example I often see is first computing some value of type `r = hard(x,y,...) / alsohard(x,y,...);` and then proceed to test `r >= 0.0 && r <= 1.0`. At minimum, you can trade the divison for a conditional sign check (of `alsohard(x,y,...)`). — Nominal Animal, Jun 06 '17 at 21:24
@NominalAnimal Okay, right, I was (incorrectly, it seems) focusing on your particular comment to Nancy-N above, "strive for precision and not speed". And I mistakenly concluded she was just trying to get a quick-and-accurate-enough approximation for her small args. Re-reading her original post, I see that doesn't seem to be her primary intent. (So I'd "delete" this answer, but wouldn't want to lose both your and EOF's interesting followup comments.) — John Forkosh, Jun 07 '17 at 01:34
@EOF Thanks for the "Horner's rule" reference. It rang a (somewhat distant) bell when I read it, so I've probably come across it in school or on-some-job afterwards, but I wasn't aware that's what I was doing until I googled it. Anyway, this particular case is so trivial that it barely (if at all) merits "rule". And, yeah, as mentioned to NominalAnimal above, I don't seem to have addressed her principal concern. P.S. You sure expm1() uses something like that? I haven't tested it, but assume it works for larger args, where you couldn't just truncate the series after a few terms. — John Forkosh, Jun 07 '17 at 01:45
`(double)multiplier)*pow((double)BASE,(double)EXPON)` all the casts here are redundant. `multiplier*pow(BASE, EXPON)` is much more readable — phuclv, Jun 07 '17 at 05:07
The answer to the question at the end of your answer is a *direct consequence* of floating-point arithmetic: You accumulate over a sequence of `1.0 + x`, where `abs(x)` is much smaller than `1`. Now, of you *multiplied* instead, you'd keep most of the significant bits of x in `(1.0 + x)*n - n =approx= n*x`. But since you accumulate linearly (`nextpartialsum = previouspartialsum + (1+xi)`, at some point `previouspartialsum` is so much larger than `1+xi` that you lose *all of the bits* of `xi`. Eventually, you'd lose the significant bits of `1` too btw., but that takes *a lot* more summing. — EOF, Jun 07 '17 at 05:36
Summing over floating-point sequences is non-trivial. You can instead sum hierarchically (sum over pairs, then pairs of pairs, ...), do Kahan summation or use a few other schemes. — EOF, Jun 07 '17 at 05:42
@EOF Thanks for the explanation, none of which (including Kahan summation) I was previously aware of, but maybe should have been. I haven't come across much **1+epsilon**-type stuff in this particular kind of situation, but https://en.wikipedia.org/wiki/Kahan_summation_algorithm is something I might have used on several occasions if I'd known about it. (P.S. looks like you were also right about expm1(), judging from its negligible timing.) — John Forkosh, Jun 07 '17 at 06:21
@LưuVĩnhPhúc Don't worry, doesn't have to be readable -- nobody's ever going to be reading it:). But you might wanna be careful about that **pow(BASE,EXPON)** part. Passing int arguments to a function expecting doubles might get you into trouble. Not sure if maybe the compiler knows all about its mlib functions, and will clean up your mess for you, but am pretty sure it's not a good habit to get into. — John Forkosh, Jun 07 '17 at 06:28
@JohnForkosh there's no problem whatsoever because int is always promoted to double in that case — phuclv, Jun 07 '17 at 07:09
@EOF I've added a second edit demonstrating code to exercise your "sum heirarchically" comment. Seems to work pretty well, as noted in the edit. As also noted there, I didn't exhaustively debug it to QA standards, i.e., as per gpl, no "warranty of merchantability":) — John Forkosh, Jun 07 '17 at 10:02
@JohnForkosh: Your code is now completely unreadable. Also, you're not doing what I proposed. You are effectively computing n partial sums and adding them in the end. This is better than making a single linear sum, but not much. What I proposed is effectively a binary tree of additions: `a+b+c+d+e+f+g+h` as `((a+b)+(c+d))+((e+f)+(g+h))`. Like this: Add adjacent pairs of elements, this produces a new sequence of n/2 elements. Recurse until only one element is left. Note that this will require n/2 elements of storage, rather than a fixed number of bins like your implementation. — EOF, Jun 07 '17 at 13:19
But if you're careful about scheduling your additions, you might get away with `O(log(n))` extra storage. — EOF, Jun 07 '17 at 13:21
@EOF Sorry about "unreadable". But it can do what you proposed, and I added an edit illustrating that. Problem is that #trials has to be very small to accommodate available static buffers. Of course I could do what you suggested, or devise a better scheme along my lines (e.g., really only need two buffer levels that can be "swapped"). But the whole thing's just a throw-away test/demo, which I don't want to turn into a project. Just wanted to get a more concrete handle on the effect(s) of your very interesting suggestions. — John Forkosh, Jun 07 '17 at 21:45
@JohnForkosh Why not simply `static double trecu(double const *vals, double accu, size_t n) { size_t half = n/2; switch (n) { case 0: return accu; case 1: return accu + *vals; default: return trecu(vals + half, trecu(vals, accu, half), n - half); } }`, to be called initially with `accu = 0`? Of course you can replace the `double *vals` accesses with function calls, but it's a manual change due to the lack of rages/iterators in C. — EOF, Jun 07 '17 at 21:52
@EOF Yeah, that's **terrific!!!** I added **yet another** edit to cut-and-paste your method. I guess I should've picked up on that from your earlier comment, comprising its "detailed design". — John Forkosh, Jun 07 '17 at 22:24
@EOF Uh-oh!!!:) I built your trecu() into the test program, and it does the sum, but it's not doing the "binary tree" type of summation. Just a "plain sum". I wrote a whole separate answer immediately below this one with a shorter (and hopefully less "unreadable":) test program to explicitly exercise trecu() and illustrate its behavior. And that indeed reproduces what I saw with this test program based on the OP's code. Please take a look when you have a chance, and let me ("us" if anybody else is still following:) know what you think. Thanks. — John Forkosh, Jun 08 '17 at 07:17

John Forkosh · Answer 3 · 2017-06-08T07:08:03.627

This is an "answer"/followup to EOF's preceding comments re his trecu() algorithm and code for his "binary tree summation" suggestion. "Prerequisites" before reading this are reading that discussion. It would be nice to collect all that in one organized place, but I haven't done that yet...

...What I did do was build EOF's trecu() into the test program from the preceding answer that I'd written by modifying the OP's original test program. But then I found that trecu() generated exactly (and I mean exactly) the same answer as the "plain sum" c using exp(), not the sum cm1 using expm1() that we'd expected from a more accurate binary tree summation.

But that test program's a bit (maybe two bits:) "convoluted" (or, as EOF said, "unreadable"), so I wrote a separate smaller test program, given below (with example runs and discussion below that), to separately test/exercise trecu(). Moreover, I also wrote function bintreesum() into the code below, which abstracts/encapsulates the iterative code for binary tree summation that I'd embedded into the preceding test program. In that preceding case, my iterative code indeed came close to the cm1 answer, which is why I'd expected EOF's recursive trecu() to do the same. Long-and-short of it is that, below, same thing happens -- bintreesum() remains close to correct answer, while trecu() gets further away, exactly reproducing the "plain sum".

What we're summing below is just sum(i),i=1...n, which is just the well-known n(n+1)/2. But that's not quite right -- to reproduce OP's problem, summand is not sum(i) alone but rather sum(1+i*10^(-e)), where e can be given on the command-line. So for, say, n=5, you don't get 15 but rather 5.000...00015, or for n=6 you get 6.000...00021, etc. And to avoid a long, long format, I printf() sum-n to remove that integer part. Okay??? So here's the code...

/* Quoting from EOF's comment...
   What I (EOF) proposed is effectively a binary tree of additions:
   a+b+c+d+e+f+g+h as ((a+b)+(c+d))+((e+f)+(g+h)).
   Like this: Add adjacent pairs of elements, this produces
   a new sequence of n/2 elements.
   Recurse until only one element is left. */
#include <stdio.h>
#include <stdlib.h>

double trecu(double *vals, double sum, int n) {
  int midn = n/2;
  switch (n) {
    case  0: break;
    case  1: sum += *vals; break;
    default: sum = trecu(vals+midn, trecu(vals,sum,midn), n-midn); break; }
  return(sum);
  } /* --- end-of-function trecu() --- */

double bintreesum(double *vals, int n, int binsize) {
  double binsum = 0.0;
  int nbin0 = (n+(binsize-1))/binsize,
      nbin1 = (nbin0+(binsize-1))/binsize,
      nbins[2] = { nbin0, nbin1 };
  double *vbins[2] = {
            (double *)malloc(nbin0*sizeof(double)),
            (double *)malloc(nbin1*sizeof(double)) },
         *vbin0=vbins[0], *vbin1=vbins[1];
  int ibin=0, i;
  for ( i=0; i<nbin0; i++ ) vbin0[i] = 0.0;
  for ( i=0; i<n; i++ ) vbin0[i%nbin0] += vals[i];
  while ( nbins[ibin] > 1 ) {
    int jbin = 1-ibin;        /* other bin, 0<-->1 */
    nbins[jbin] = (nbins[ibin]+(binsize-1))/binsize;
    for ( i=0; i<nbins[jbin]; i++ ) vbins[jbin][i] = 0.0;
    for ( i=0; i<nbins[ibin]; i++ )
      vbins[jbin][i%nbins[jbin]] += vbins[ibin][i];
    ibin = jbin;              /* swap bins for next pass */
    } /* --- end-of-while(nbins[ibin]>0) --- */
  binsum = vbins[ibin][0];
  free((void *)vbins[0]);  free((void *)vbins[1]);
  return ( binsum );
  } /* --- end-of-function bintreesum() --- */

#if defined(TESTTRECU)
#include <math.h>
#define MAXN (2000000)
int main(int argc, char *argv[]) {
  int N       = (argc>1? atoi(argv[1]) : 1000000 ),
      e       = (argc>2? atoi(argv[2]) : -10 ),
      binsize = (argc>3? atoi(argv[3]) : 2 );
  double tens = pow(10.0,(double)e);
  double *vals = (double *)malloc(sizeof(double)*MAXN),
         sum = 0.0;
  double trecu(), bintreesum();
  int i;
  if ( N > MAXN ) N=MAXN;
  for ( i=0; i<N; i++ ) vals[i] = 1.0 + tens*(double)(i+1);
  for ( i=0; i<N; i++ ) sum += vals[i];
  printf(" N=%d, Sum_i=1^N {1.0 + i*%.1e} - N  =  %.8e,\n"
         "\t plain_sum-N  = %.8e,\n"
         "\t trecu-N      = %.8e,\n"
         "\t bintreesum-N = %.8e \n",
         N, tens, tens*((double)N)*((double)(N+1))/2.0,
          sum-(double)N,
         trecu(vals,0.0,N)-(double)N,
         bintreesum(vals,N,binsize)-(double)N );
  } /* --- end-of-function main() --- */
#endif

So if you save that as trecu.c, then compile it as cc DTESTTRECU trecu.c lm o trecu And then run with zero to three optional command-line args as trecu #trials e binsize Defaults are #trials=1000000 (like OP's program), e=10, and binsize=2 (for my bintreesum() function to do a binary-tree sum rather than larger-size bins).

And here are some test results illustrating the problem described above,

bash-4.3$ ./trecu              
 N=1000000, Sum_i=1^N {1.0 + i*1.0e-10} - N  =  5.00000500e+01,
         plain_sum-N  = 5.00000500e+01,
         trecu-N      = 5.00000500e+01,
         bintreesum-N = 5.00000500e+01 
bash-4.3$ ./trecu 1000000 -15
 N=1000000, Sum_i=1^N {1.0 + i*1.0e-15} - N  =  5.00000500e-04,
         plain_sum-N  = 5.01087168e-04,
         trecu-N      = 5.01087168e-04,
         bintreesum-N = 5.00000548e-04 
bash-4.3$ 
bash-4.3$ ./trecu 1000000 -16
 N=1000000, Sum_i=1^N {1.0 + i*1.0e-16} - N  =  5.00000500e-05,
         plain_sum-N  = 6.67552231e-05,
         trecu-N      = 6.67552231e-05,
         bintreesum-N = 5.00001479e-05 
bash-4.3$ 
bash-4.3$ ./trecu 1000000 -17
 N=1000000, Sum_i=1^N {1.0 + i*1.0e-17} - N  =  5.00000500e-06,
         plain_sum-N  = 0.00000000e+00,
         trecu-N      = 0.00000000e+00,
         bintreesum-N = 4.99992166e-06

So you can see that for the default run, e=10, everybody's doing everything right. That is, the top line that says "Sum" just does the n(n+1)/2 thing, so presumably displays the right answer. And everybody below that agrees for the default e=10 test case. But for the e=15 and e=16 cases below that, trecu() exactly agrees with the plain_sum, while bintreesum stays pretty close to the right answer. And finally, for e=17, plain_sum and trecu() have "disappeared", while bintreesum()'s still hanging in there pretty well.

So trecu()'s correctly doing the sum all right, but its recursion's apparently not doing that "binary tree" type of thing that my more straightforward iterative bintreesum()'s apparently doing correctly. And that indeed demonstrates that EOF's suggestion for "binary tree summation" realizes quite an improvement over the plain_sum for these 1+epsilon kind of cases. So we'd really like to see his trecu() recursion work!!! When I originally looked at it, I thought it did work. But that double-recursion (is there a special name for that?) in his default: case is apparently more confusing (at least to me:) than I thought. Like I said, it is doing the sum, but not the "binary tree" thing.

Okay, so who'd like to take on the challenge and explain what's going on in that trecu() recursion? And, maybe more importantly, fix it so it does what's intended. Thanks.

Yes, I was trying to be too clever. I mistakenly thought that the function had to be tail-recursive to ensure `O(log(n))` stack space to avoid stack overflow. Actually, the even simpler function `double foo(double const *vals, size_t n) { if (n < 2) return n ? *vals : 0; size_t half = n/2; return foo(vals, half) + foo(vals + half, n - half); }` does the pairwise sum correctly, while using reasonable stack space (I believe it is `O(log(n))` ). Anyway, we're hijacking this question, so I'll stop here. — EOF, Jun 08 '17 at 18:20
@EOF Thanks, that indeed works as advertised. And, as you say, it's easy to read, too. — John Forkosh, Jun 08 '17 at 19:22

I do *not* want correct rounding for function exp

3 Answers3

To answer the general question on why the library functions are required to give correctly rounded results:

In many cases, specific knowledge about floating point algorithms can produce considerable improvements to accuracy and/or performance, like in the testcase:

Programmers with advanced knowledge about the accompanying trade-offs may sometimes consider using approximate results where the precision is not critical

I do not want correct rounding for function exp