Why is atan first call much slower than the next ones?

Question

The following code demonstrates that atan computation time can vary a lot:

#include <cstdio>
#include <cstdlib>
#include <cmath>

#include <sys/time.h>
#include <sys/resource.h>
#include <unistd.h>

double get_time()
{
    struct timeval t;
    struct timezone tzp;
    gettimeofday(&t, &tzp);
    return t.tv_sec + t.tv_usec*1e-6;
}

int main() {
    double worst_time = 0.0;
    double best_time = 1e6;

    volatile double x0 = -M_PI/2.0;
    volatile double foo = atan(x0); // SLOW CALL HERE
    volatile double sum = 0.0; // volatile to avoid having tan() call optimized away
    for (double x = x0; x < M_PI/3.0; x += 0.1) {
        volatile double y = x;
        const double start = get_time();
        asm volatile ("":::"memory"); // avoid reordering in -O3
        const double value = atan(y);
        asm volatile ("":::"memory"); // avoid reordering
        const double end = get_time();
        sum += value;

        const double delta = end - start;
        if (delta > worst_time) {
            worst_time = delta;
        }
        if (delta < best_time) {
            best_time = delta;
        }
        printf("* %f (value: %f)\n", delta, y);
    }

    printf("%f / %f\n", worst_time, best_time);

    printf("%f\n", foo);
}

From my machine worst time is around 15us whereas the best time is 0 (too small to be measured).

The average time (not displayed here) on my machine is around 1 or 2 us.

I tried different compilation flags (-O3, linking statically to libm, etc.) but I cannot find what causes the worst time to be much slower. Any idea?

edit: I am using Ubuntu 14.04 - gcc 4.8.4

edit2: replace atan2 by atan. I am not interested by the fact that atan2 is defined piece-wise and different branches may take different times. I am interested in eliminating the outliers which can appear even if atan is called instead of atan2.

edit3:

* 0.000015 (value: -1.570796)
* 0.000000 (value: -1.470796)
* 0.000001 (value: -1.370796)
* 0.000001 (value: -1.270796)
* 0.000000 (value: -1.170796)
* 0.000002 (value: -1.070796)
* 0.000000 (value: -0.970796)
* 0.000001 (value: -0.870796)
* 0.000000 (value: -0.770796)
* 0.000000 (value: -0.670796)
* 0.000001 (value: -0.570796)
* 0.000000 (value: -0.470796)
* 0.000003 (value: -0.370796)
* 0.000001 (value: -0.270796)
* 0.000000 (value: -0.170796)
* 0.000000 (value: -0.070796)
* 0.000001 (value: 0.029204)
* 0.000000 (value: 0.129204)
* 0.000002 (value: 0.229204)
* 0.000001 (value: 0.329204)
* 0.000000 (value: 0.429204)
* 0.000001 (value: 0.529204)
* 0.000001 (value: 0.629204)
* 0.000001 (value: 0.729204)
* 0.000001 (value: 0.829204)
* 0.000001 (value: 0.929204)
* 0.000000 (value: 1.029204)
0.000015 / 0.000000 / 0.000001

edit4:

It appears that the first call is the culprit! The call outside the loop was optimized away by the compiler, if we force atan to be evaluated outside of the loop for x0, all the calls are reasonably fast...

* 0.000000 (value: -1.570796)
* 0.000001 (value: -1.470796)
* 0.000000 (value: -1.370796)
* 0.000002 (value: -1.270796)
* 0.000001 (value: -1.170796)
* 0.000001 (value: -1.070796)
* 0.000000 (value: -0.970796)
* 0.000000 (value: -0.870796)
* 0.000000 (value: -0.770796)
* 0.000001 (value: -0.670796)
* 0.000000 (value: -0.570796)
* 0.000000 (value: -0.470796)
* 0.000006 (value: -0.370796)
* 0.000001 (value: -0.270796)
* 0.000002 (value: -0.170796)
* 0.000001 (value: -0.070796)
* 0.000000 (value: 0.029204)
* 0.000001 (value: 0.129204)
* 0.000003 (value: 0.229204)
* 0.000000 (value: 0.329204)
* 0.000000 (value: 0.429204)
* 0.000000 (value: 0.529204)
* 0.000001 (value: 0.629204)
* 0.000000 (value: 0.729204)
* 0.000000 (value: 0.829204)
* 0.000000 (value: 0.929204)
* 0.000000 (value: 1.029204)
0.000006 / 0.000000

https://ideone.com/vtUuE6

Run the atan in a loop, measuring the total time per value, to offset time accuracy limitations. Also look at a graph of the time/value to see if the 'min/max' is not particularly misleading. — user2864740, Jul 17 '15 at 03:26
Interesting reading wrt. measuring such: http://stackoverflow.com/questions/88/is-gettimeofday-guaranteed-to-be-of-microsecond-resolution — user2864740, Jul 17 '15 at 03:35
FYI, on Arch with glibc 2.21, `-O3` with gcc 5.1.0 leads to no difference (same with `-O1` actually), but I do see a difference with clang 3.6.1 after lowering the x step to `0.01`. — BenC, Jul 17 '15 at 04:08
You can't measure performance that way. Replace `sum += atan2(x, y);` with `;` and you'll see much the same. — David Hammen, Jul 17 '15 at 05:07
@DavidHammen My bad, I did not realize the bench was without -O3, the new version exhibing the same behavior is with -O3. I just had to mark the variable volatile. — Thomas Moulard, Jul 17 '15 at 05:13
Answering you own post is good in SO and even accepting it, but the answer should be in an answer and not an edited question. By accepting an answer, even your own if the best way to show this post is answered. — chux - Reinstate Monica, Jul 23 '15 at 03:19

score 1 · Accepted Answer · answered Jul 24 '15 at 02:07

1

The timing difference is actually caused by pages faults (!). The first time the function is called, the page containing atan2 code is accessed and a page fault occurs. Using mlockall() should improve the situation.

answered Jul 24 '15 at 02:07

Thomas Moulard

5,151
2
21
28

score 0 · Answer 2 · answered Jul 17 '15 at 03:14

0

atan2 is a piecewise function, ie for certain values/ranges of values it performs different operations, some of which are just returning a constant value, which is quite fast, but others involve actual trigonometric calculations, which can take quite some time. If you want the particulars they are available at https://en.wikipedia.org/wiki/Atan2

answered Jul 17 '15 at 03:14

joshbooks

489
3
8

Sorry I edited my question as I am not interested in this direction. This problem happens with atan two which is not piecewise so this is cannot only be due to this issue. Also if you modify the x range to match an interval of atan2 where atan2 is always defined in the same way the problem still exists. – Thomas Moulard Jul 17 '15 at 03:23
@ThomasMoulard Fundamental floating point ops can differ greatly in speed according to the magnitudes of the operands. This of course depends on properties of the hardware, but it's true for all I've tested. Since atan2 is implemented with atan, and atan requires many floating point ops, it isn't a surprise that run times vary. – Gene Jul 17 '15 at 03:40
@Gene Apparently it is not only that. I understand that floating point operation processing time can vary but here a pattern clearly appears (see updated post). And considering than I increase x monotonically and atan is monotonic too, one would have the impression that atan run time should vary smoothly... Here the noise due to the floating point operations taking different time seems to be around ~2us max so... – Thomas Moulard Jul 17 '15 at 03:47
@Gene apparently this pattern only appear with optimization disabled... I am not sure what is different when I compile with -O3 but it seems to make the problem disappear... – Thomas Moulard Jul 17 '15 at 04:18
@ThomasMoulard I don't suppose you feel interested enough to output assembly and see what the difference is, do you? – joshbooks Jul 17 '15 at 04:21
@redball: FYI, https://gcc.godbolt.org/ makes it easier to compare assembly output between different compilers. – BenC Jul 17 '15 at 04:30
@BenC: thanks, I hadn't heard about that, it looks like an interesting tool – joshbooks Jul 17 '15 at 05:14
I saw your edit after running a buch of trials, and yeah, it looks like all we're seeing is the first call cost of a dynamically linked library D: I feel silly now – joshbooks Jul 18 '15 at 07:34

Why is atan first call much slower than the next ones?

2 Answers2