How are arcsin and arccos typically implemented?

Question

I was reading through Agner's list of assembly codes for x86 and x87 and noticed that there is no op-code for arcsin or arccos, but only arctan. So I've googled it and all the results implemented it by using atan and sqrt, which would mean that acos and asin should be significantly slower than atan because you need an additional sqrt, but I wrote a simple test program in C++ and acos and asin are both faster than atan:

#include <chrono>
#include <cmath>
#include <iostream>

class timer {
    private:
        decltype(std::chrono::high_resolution_clock::now()) begin, end;

    public:
        void
        start() {
            begin = std::chrono::high_resolution_clock::now();
        }

        void
        stop() {
            end = std::chrono::high_resolution_clock::now();
        }

        template<typename T>
        auto
        duration() const {
            return std::chrono::duration_cast<T>(end - begin).count();
        }

        auto
        nanoseconds() const {
            return duration<std::chrono::nanoseconds>();
        }

        void
        printNS(char const* str) const {
            std::cout << str << ": " << nanoseconds() << std::endl;
        }
};

int
main(int argc, char**) {
    timer timer;

    double p1 = 0 + 0.000000001;
    double acc1{1};

    timer.start();
    //less than 8 seconds
    for(int i{0}; 200000000 > i; ++i) {
        acc1 += std::acos(i * p1);
    }
    timer.stop();
    timer.printNS("acos");

    timer.start();
    //less than 8 seconds
    for(int i{0}; 200000000 > i; ++i) {
        acc1 += std::asin(i * p1);
    }
    timer.stop();
    timer.printNS("asin");

    timer.start();
    //more than 12 seconds
    for(int i{0}; 200000000 > i; ++i) {
        acc1 += std::atan(i * p1);
    }
    timer.stop();
    timer.printNS("atan");

    timer.start();
    //almost 20 seconds
    for(int i{0}; 200000000 > i; ++i) {
        acc1 += std::atan2(i * p1, i * p1);
    }
    timer.stop();
    timer.printNS("atan");

    std::cout << acc1 << '\n';
}

I've tried seeing the assembly on godbolt, but it doesn't inline acos or asin.

So how is it implemented or if it actually just uses atan, how can it be faster?

Modern math libraries do not use the x87 instructions as they are slower than direct implementations. That said, an `fsqrt` is quite fast with only 4 cycles on modern processors, so it doesn't matter too much. — fuz, Aug 06 '19 at 01:02
@fuz: `fsqrt` is not a "complex" microcoded x87 instruction. It's one of the "basic" operations along with div/mul/add/sub that even SSE/AVX implement (`sqrtsd`), and that are required to have <= 0.5ulp error (i.e. correctly rounded) unlike trig / exp / log. It's also a single uop, unlike ~100 uops for x87 instructions like `fsin`. https://agner.org/optimize/. — Peter Cordes, Aug 06 '19 at 01:08
@harold: yes, best-case `fsqrt` throughput is one per 4 to 7 cycles on Skylake and Ryzen. vs. SSE `sqrtsd` of 4-6 cycles throughput on Skylake or `sqrtss` at 3 cycles. Much worse on older CPUs. — Peter Cordes, Aug 06 '19 at 01:12
maybe a duplicate of [How does C compute sin() and other math functions?](//stackoverflow.com/q/2284860). If you want to see the asm you actually microbenched, single-step *into* one of those function calls with a debugger, obviously. You haven't told use what OS, compiler, or CPU microarchitecture you're using. Different OSes have different math-library implementations of complex functions. — Peter Cordes, Aug 06 '19 at 01:16
@PeterCordes Where did I say that fsqrt is complex or microcoded? — fuz, Aug 06 '19 at 01:19
I'm voting to close this question as off-topic because it really needs OS, exact CPU, compiler, libraries used, etc. — Joshua, Aug 06 '19 at 01:19
Sorry, I was testing it on an AMD Athlon 64 X2, admittedly quite old, but I was assuming that the basic instruction sets hadn't changed much within the last years. Running on Ubuntu 18.04 compiled with g++. I will try stepping in with a debugger know, maybe I will undertand it. — Michael Mahn, Aug 06 '19 at 01:34
The instruction set *hasn't* changed much since your CPU was new. SSE2 is baseline for x86-64 so math libraries can always assume it. In fact SSE2 is the standard way for doing scalar math on x86-64. The fact that x87 instructions still exist doesn't mean they're still used. — Peter Cordes, Aug 06 '19 at 01:47
This question might be a duplicate, but it's certainly not a duplicate of any of the *three* questions linked above, none of which deal with inverse trig functions. Also nearly all of the answers to the linked questions are incorrect -- no actual production c++ implementation does sine and cosine with raw Taylor series, for instance, despite what most answers to the second question claim. — Daniel McLaury, Aug 06 '19 at 01:54
http://www.netlib.org/fdlibm/k_sin.c -- Note how the coefficients are not quite equal to 1/3!, 1/5!, 1/7!, etc. This says that they are not Taylor series coefficients. — Rick James, Aug 14 '19 at 20:42

How are arcsin and arccos typically implemented?

0 Answers0