Why is std::sin() and std::cos() slower than sin() and cos()?

Question

Test code:

#include <cmath>
#include <cstdio>

const int N = 4096;
const float PI = 3.1415926535897932384626;

float cosine[N][N];
float sine[N][N];

int main() {
    printf("a\n");
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            cosine[i][j] = cos(i*j*2*PI/N);
            sine[i][j] = sin(-i*j*2*PI/N);
        }
    }
    printf("b\n");
}

Here is the time:

$ g++ main.cc -o main
$ time ./main
a
b

real    0m1.406s
user    0m1.370s
sys     0m0.030s

After adding using namespace std;, the time is:

$ g++ main.cc -o main
$ time ./main
a
b

real    0m8.743s
user    0m8.680s
sys     0m0.030s

Compiler:

$ g++ --version
g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2

Assembly:

Dump of assembler code for function sin@plt:                                    
0x0000000000400500 <+0>:     jmpq   *0x200b12(%rip)        # 0x601018 <_GLOBAL_OFFSET_TABLE_+48>
0x0000000000400506 <+6>:     pushq  $0x3                                     
0x000000000040050b <+11>:    jmpq   0x4004c0                                 
End of assembler dump.

Dump of assembler code for function std::sin(float):                            
0x0000000000400702 <+0>:     push   %rbp                                     
0x0000000000400703 <+1>:     mov    %rsp,%rbp                                
0x0000000000400706 <+4>:     sub    $0x10,%rsp                               
0x000000000040070a <+8>:     movss  %xmm0,-0x4(%rbp)                         
0x000000000040070f <+13>:    movss  -0x4(%rbp),%xmm0                         
0x0000000000400714 <+18>:    callq  0x400500 <sinf@plt>                      
0x0000000000400719 <+23>:    leaveq                                          
0x000000000040071a <+24>:    retq                                            
End of assembler dump.

Dump of assembler code for function sinf@plt:                                   
0x0000000000400500 <+0>:     jmpq   *0x200b12(%rip)        # 0x601018 <_GLOBAL_OFFSET_TABLE_+48>
0x0000000000400506 <+6>:     pushq  $0x3                                     
0x000000000040050b <+11>:    jmpq   0x4004c0                                 
End of assembler dump.

@Nawaz: It might. It's an implementation detail whether `` provides `double sin(double)` and `double cos(double)` in the global namespace. Ditto for `` and `printf`. — Ben Voigt, Aug 07 '11 at 23:24
The easiest way to answer issues like this is by comparing the assembly output of the compiler. — Kerrek SB, Aug 07 '11 at 23:41
@David: That had better not be the total definition of `std::cos`. See 26.8/8 and 26.8/9. (Also I believe that 26.8/4 can be interpreted to mean that these overloads must not be provided in the global namespace.) Or does D.5 require that they ARE available globally. It is a little confusing. — Ben Voigt, Aug 07 '11 at 23:42
Funny thing, I tested here, compiling without optimization is way faster than with O3... Just the fact of changing from his PI to M_PI made a lot of difference, don't know why. — fbafelipe, Aug 07 '11 at 23:59
@Ben Voight: I misread the question, I read `PI` definition as `double PI`... and just provided the overload that matched the `double` argument. In gcc `std::cos( float )` is defined as a call to `__builtin_cosf`, so the overload (and the implementation) will differ. — David Rodríguez - dribeas, Aug 08 '11 at 07:38

score 21 · Accepted Answer · answered Aug 07 '11 at 23:23

21

You're using a different overload:

Try

        double angle = i*j*2*PI/N;
        cosine[i][j] = cos(angle);
        sine[i][j] = sin(angle);

it should perform the same with or without using namespace std;

answered Aug 07 '11 at 23:23

Ben Voigt

277,958
43
419
720

Your code works, but it runs fast with or without the namespace change. Why does the code I provided run much slower? – ornerylawn Aug 07 '11 at 23:39
13

@Ryan: Because my code always calls `double sin(double)`. Your original code calls either `double sin(double)` from the global scope, or `float sin(float)` from `namespace std`. Modern FPUs are optimized for operations on doubles. – Ben Voigt Aug 07 '11 at 23:40
Added some assembly, does your conclusion still hold? (I'm no assembly ninja) – ornerylawn Aug 08 '11 at 00:03
@Ryan: What would be more interesting is the assembly listing for your code (especially the part inside the loop) – Ben Voigt Aug 08 '11 at 00:06
The only difference is a conversion from float to double. But is it really the use of float that hurts? Could it be the extra function call? – ornerylawn Aug 08 '11 at 00:33
1

@BenVoigt Do you have measurements supporting the "Modern FPUs are optimized for operations on doubles." argument? On my laptop it's quite the opposite, std::sin is consistently 2.5x slower on double that on float. – Olivier Sohn Jul 05 '18 at 16:53
@OlivierSohn: You almost certainly are compiling floating-point math to SSE (if Intel-arch) or NEON (if ARM) instructions and not using your FPU. And if you benchmarked on a GPU (say using CUDA compiler) you'd probably find an 8:1 advantage for `float`. – Ben Voigt Aug 01 '22 at 21:20
@BenVoigt That makes no sense to me. All code compiled for x64 uses sse instructions. What's the fpu that doesn't handle sse instructions on floating point numbers and why would it be faster on doubles than on floats? – Bas Dec 17 '22 at 15:31
1

@Bas: It is certainly not true that all x64 instructions are SSE. The x87 (once upon a time a co-processor, now integrated) instructions are optimized for long-double-precision, while SSE instructions are optimized for single-precision vectors. See for example https://stackoverflow.com/a/58481776/103167 Most compilers require you to specify which set of floating-point instructions to use, because mixing them comes with a tremendous performance penalty. – Ben Voigt Dec 18 '22 at 23:54
@BenVoigt ah cool thanks. I didn't know you could use x87 instructions in x64 mode. – Bas Dec 20 '22 at 00:57

score 4 · Answer 2 · answered Aug 07 '11 at 23:27

4

I guess the difference is that there are overloads for std::sin() for float and for double, while sin() only takes double. Inside std::sin() for floats, there may be a conversion to double, then a call to std::sin() for doubles, and then a conversion of the result back to float, making it slower.

answered Aug 07 '11 at 23:27

Rudy Velthuis

28,387
5
46
94

The conversions between `float` and `double` do not account for it. I ran some tests today with g++ and found that when using `-O2` the `float` code was much slower. However, when I tested with manual conversions, like this: `(float)sin((double)input)` I found that the optimized `float` code ran _faster_ than the optimized `double` code, even though I was forcing the `float` code to use the `double` `sin` function. – Kyle A Jul 07 '17 at 01:40
@KyleA: That was 2011. Now is 2017. The runtime code may have changed. – Rudy Velthuis Jul 07 '17 at 06:06
@RudyVelthuis see my answer – Olivier Sohn Jul 05 '18 at 17:24
@OliverSohn: And that is a reply to a comment almost exactly one year ago. Is this trying to be the slowest conversation on SO? – Rudy Velthuis Jul 05 '18 at 18:03

score 2 · Answer 3 · answered Jul 05 '18 at 17:23

I did some measurements using clang with -O3 optimization, running on an Intel Core i7. I found that:

std::sin on float has the same cost as sinf
std::sin on double has the same cost as sin
The sin functions on double are 2.5x slower than on float (again, running on an Intel Core i7).

Here is the full code to reproduce it:

#include <chrono>
#include <cmath>
#include <iostream>

template<typename Clock>
struct Timer
{
    using rep = typename Clock::rep;
    using time_point = typename Clock::time_point;
    using resolution = typename Clock::duration;

    Timer(rep& duration) :
    duration(&duration) {
        startTime = Clock::now();
    }
    ~Timer() {
        using namespace std::chrono;
        *duration = duration_cast<resolution>(Clock::now() - startTime).count();
    }
private:

    time_point startTime;
    rep* duration;
};

template<typename T, typename F>
void testSin(F sin_func) {
  using namespace std;
  using namespace std::chrono;
  high_resolution_clock::rep duration = 0;
  T sum {};
  {
    Timer<high_resolution_clock> t(duration);
    for(int i=0; i<100000000; ++i) {
      sum += sin_func(static_cast<T>(i));
    }
  }
  cout << duration << endl;
  cout << "  " << sum << endl;
}

int main() {
  testSin<float> ([] (float  v) { return std::sin(v); });
  testSin<float> ([] (float  v) { return sinf(v); });
  testSin<double>([] (double v) { return std::sin(v); });
  testSin<double>([] (double v) { return sin(v); });
  return 0;
}

I'd be interested if people could report, in the comments on the results on their architectures, especially regarding float vs. double time.

On Linux "ARMv8 Processor rev 0 (v8l)" I get these time ticks (in billions) with -O0: 63G, 61G, 7.2G, 7.1G. With -O3 it is 59.6G, 59.1G, and 2x6.6G. So **float is ~8.7x/~9x slower** than double, while std::xxx vs xxx is probably irrelevant. On "Intel Xeon W-2123 CPU @ 3.60GHz", With -O0 I get 2.2G, 1.98G, and 2x 3.5G, with -O3 it is 2x1.6G and 2x3.0G, i.e. here **float is ~1.7 (O0) and ~1.8x faster** than double. My uneducated guess is, that this is due to The float result is -1.68, while the double variant returns 0.78. The flag -march=native on Xeon, makes float as slow as double. — FelEnd, Dec 04 '19 at 08:57
@FelEnd I'm very surprised by the results on "ARMv8 Processor rev 0 (v8l)" : do you know why it's so much faster using doubles? — Olivier Sohn, Dec 09 '19 at 08:18
No, I don't. The result holds for multiple variations I tried with your code. I tried going down the rabbit hole that is glibc's libm source code, but couldn't really pinpoint what is executed. In the assembly of your code, the difference is "bl sinf" vs "bl sin" (bl = branch with link). My blind guess is that the arm's fpu is designed for double and offers some intrinsic instruction, but one would need to inspect the assembly of libm for that. — FelEnd, Dec 12 '19 at 09:32
After the compiler instantiates the templates, it can easily see these functions being called in a loop, at which point a good compiler will auto-vectorize and use the SIMD unit. Twice as many `float` fit in the SIMD register at once, as well as the fact the SIMD unit is optimized differently than the SISD FPU. Compilers for x86 and x86_64 architectures tend to have a switch to force choice between x87 FPU vs SSE SIMD unit, and compilers for ARM tend to have a switch to force choice between FPU and NEON SIMD unit. — Ben Voigt, Aug 01 '22 at 21:19

score 1 · Answer 4 · answered Aug 07 '11 at 23:44

1

Use -S flag in compiler command line and check the difference between assembler output. Maybe using namespace std; gives a lot of unused stuff in executable file.

answered Aug 07 '11 at 23:44

George Gaál

1,216
10
21

That's why I had the print statements, so that if you ran the code you could see that most of the time is spent in the loop, not initialization. – ornerylawn Aug 08 '11 at 00:00

Why is std::sin() and std::cos() slower than sin() and cos()?

4 Answers4

Linked