Inline assembly vs math library

Question

High, could anyone help me understand why is it more efficient calling math library functions than writing inline assembly code to perform the same operation?. I wrote this simple test:

#include <stdio.h>
#define __USE_GNU
#include <math.h>

void main( void ){
    float ang;
    int i;

    for( i = 0; i< 1000000; i++){
        ang = M_PI_2 * i/2000000;
    /*__asm__ ( "fld %0;"
              "fptan;"
              "fxch;"
              "fstp %0;" : "=m" (ang) : "m" (ang)
    ) ;*/
    ang = tanf(ang);
    }
    printf("Tan(ang): %f\n", ang);
}

That code computes the tangent of an angle in 2 different ways, one calling the tanf function from the dinamically linked library libm.a, and the second one using inline assembly code. Note that I comment parts of the code alternatively. The code performs the operation several times to get meaningful results with command time y linux terminal.

The version that uses math library takes around 0.040s. The version that uses assembly code takes around 0.440s; ten times more.

These are the results of disassembly. Both have been compiled with -O3 option.

LIBM

4005ad: b8 db 0f c9 3f          mov    $0x3fc90fdb,%eax
  4005b2:   89 45 f8                mov    %eax,-0x8(%rbp)
  4005b5:   f3 0f 10 45 f8          movss  -0x8(%rbp),%xmm0
  4005ba:   e8 e1 fe ff ff          callq  4004a0 <tanf@plt>
  4005bf:   f3 0f 11 45 f8          movss  %xmm0,-0x8(%rbp)
  4005c4:   83 45 fc 01             addl   $0x1,-0x4(%rbp)
  4005c8:   83 7d fc 00             cmpl   $0x0,-0x4(%rbp)
  4005cc:   7e df                   jle    4005ad <main+0x19>

ASM

40050d: b8 db 0f c9 3f          mov    $0x3fc90fdb,%eax
  400512:   89 45 f8                mov    %eax,-0x8(%rbp)
  400515:   d9 45 f8                flds   -0x8(%rbp)
  400518:   d9 f2                   fptan  
  40051a:   d9 c9                   fxch   %st(1)
  40051c:   d9 5d f8                fstps  -0x8(%rbp)
  40051f:   83 45 fc 01             addl   $0x1,-0x4(%rbp)
  400523:   83 7d fc 00             cmpl   $0x0,-0x4(%rbp)
  400527:   7e e4                   jle    40050d <main+0x19>

Any idea? Thanks.

I think I got an idea. Browsing the glibc code I found out that tanf function is implemented through a polynomial aproximation and using the sse extension. I guess that's turn to be faster than microcode for the fptan instruction.

did you disassemble it and see if there was some optimization that happened? — old_timer, Mar 08 '14 at 17:59
Maybe it can't optimise mixed assembly and C. But can optimise C by itself and ignores all but the last loop. — QuentinUK, Mar 08 '14 at 18:05
An optimizing compiler is not required to calculate results you never use, so there is no guarantee that the operation is being performed other than the last time. But speaking more to the intended question, the authors of the math library are probably better at writing efficient assembly code than you are. — Chris Stratton, Mar 08 '14 at 18:09
I compiled the two versions using full optimization (-O3). I disassembled the two object files and I see the loop is present in both. I attach part of the disassembly listing in the original question. — pcremades, Mar 08 '14 at 20:10
Please help me understand the assembly code: where exactly is the current value of ang, -0x8(%rbp), computed from i, -0x4(%rbp)? Or is it that the optimizing compiler detected that i/2000000 is, as result of an integer division, always zero so that ang=0 is loaded as precomputed constant? Better use M_PI/2000000*i. — Lutz Lehmann, Mar 08 '14 at 21:23
@LutzL the `M_PI_2 * i` part of the expression should be automatically promoted to a double first, and then the division too. — Alnitak, Mar 08 '14 at 21:49
LutzL, you are right, the division was not computed. My mistake. Any way, the tangent is computed one million times in both cases. And yes, is in -0x8(%rbp). — pcremades, Mar 08 '14 at 21:55
@QuentinUK, GCC certainly can optimize (around) assembly language. You tell it what inputs and outputs are, and what data the assembly snippet might change, and the compiler rearranges what happens around it. — vonbrand, Mar 08 '14 at 22:06
@Alnitak: Yes, it should, but the assembler code says that there is no operation at all happening for this line. It is move, call to atan, move and increase i, compare to 0 and jump. It is not shown, but apparently the assembler i is initialized to -1000000. — Lutz Lehmann, Mar 08 '14 at 22:42
Yes, you are right. But again, both perform the mathematical operation, no matter the operand. So the question remains. How is it possible that calling a math library is faster than using the fptan instruction. — pcremades, Mar 09 '14 at 13:33

Aki Suihkonen · Answer 1 · 2014-03-09T18:50:57.363

There's a major difference in the implementation of those functions.

fptan is a legacy 8087 instruction using the floating point stack. Even originally 8087 instructions were microcoded. Invoking fptan instruction caused a predefined program to be run in the 8087 CPU, which would have utilized the fundamental capabilities of the processor, such as floating point addition or even multiplication. Microcoding bypasses some stages of the "natural" pipelining, e.g. prefetch and decode and it speeds up the process.

The algorithm selected for trigonometric functions in 8087 was CORDIC.

Even though the microcoding made fptan faster than explicitly calling each instruction, this was not the end of floating point processor development; We could rather say, that 8087 development was over. In future processors fptan must probably be implemented as is, as an IP block, that behaves identically to the original instruction with some glue logic in order of producing bit-by-bit exact output as the original.

Later processors first recycled the FP stack for "MMX". Then a completely new set of registers were introduced (XMM) along with an instruction set (SSE) capable of parallel execution of basic floating point operations. First at all, support for extended precision floating points (80-bits) was dropped. Then again, 20+ years of Moore's Law allowed allocation of much higher transistor count to build up e.g. 64x64 bit parallel multipliers speeding up the multiplication throughput.

Other instructions have suffered too: loop was once faster than sub ecx, 1; jnz combination. aam is probably slower today than conditionally adding 10 to some nibble of eax -- these 20+ years of Moore's law have allowed millions of transistors to speed the prefetch stage: In 8086 every single byte in the instruction encoding counted as one more cycle. Today several instruction execute within a single cycle, because the instruction are already fetched from memory.

That being said, you can also try if a single instruction, such as aam is actually faster, than implementing it's contents using an equivalent set of simpler instructions, that are optimized. This is the benefit of a library: they can use the fptan instruction, but they don't need, if the processor architecture supports some faster set of instructions, more parallelism, a faster algorithm or all of those.

But according to 64-ia-architecture manual, xmm register are physically mapped to FPU registers. — pcremades, Mar 08 '14 at 21:28
@user3396631: No, the **MMX** registers are mapped to the x87-stack. MMX is just as obsolete as the x87-FPU with the availability of the XMM/YMM registers of SSE-SSE4/AVX/whatever-SIMD. — EOF, Mar 08 '14 at 23:04
The FPU registers and stack is the same thing. On the other hand, math library seems to use MMX because the operand is stored in xmm0 register before calling tanf. — pcremades, Mar 09 '14 at 13:25
MMX was the first SIMD register set, which mapped over ST(0)..(7) to be compliant with existing OS's, which had to save/restore FPU at task switch. XMM is a completely separate set of registers available originally through "SSE" instruction set. **MMX != XMM**. — Aki Suihkonen, Mar 09 '14 at 15:31

vonbrand · Answer 2 · 2014-03-08T18:43:29.020

Here (Fedora 20, gcc-4.8.2-7.fc20.x86_64, Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz compiled with -O2) I see user time 0.161s (asm) vs 0.076s (libm).

And while the compiler would be allowed to get rid of the loop in the library version (it knows tanf(3m) is a pure function), the assembly shows that the loop is there. And the function isn't inlined, it is a function call here. Still faster. Strange.

OK, it looks the difference is due to shuffling the argument to the asm() snippet around (it is placed into the local variable, and used from there). I'm no expert on x86_64, and my GCC asm constraints-fu has rusted...

(In any case, you'd have to substract the whole for loop, and the computing the angle. For a simple operation like this, that could be a very significant fraction of the total).

Inline assembly vs math library

2 Answers2

Linked