Why is the standard "abs" function faster than mine?

Question

I wanted to try making my own absolute value function. I figured that the fastest way to calculate absolute value would be to simply mask out the sign bit (the last bit in IEEE 754). I wanted to compare it's speed to the standard abs function. Here is my implementation:

// Union used for type punning
union float_uint_u
{
    float f_val;
    unsigned int ui_val;
};

// 'MASK' has all bits == 1 except the last one
constexpr unsigned int MASK = ~(1 << (sizeof(int) * 8 - 1));

float abs_bitwise(float value)
{
    float_uint_u ret;
    ret.f_val = value;
    ret.ui_val &= MASK;
       
    return ret.f_val;
}

For the record, I know that this sort of type punning is not standard C++. However, this is just for educational purposes, and according to the docs, this is supported in GCC.

I figured this should be the fastest way to calculate absolute value, so it should at the very least be as fast as the standard implementation. However, timing 100000000 iterations of random values, I got the following results:

Bitwise time: 5.47385 | STL time: 5.15662
Ratio: 1.06152

My abs function is about 6% slower.

Assembly output

I compiled with -O2 optimization and the -S option (assembly output) to help determine what was going on. I have extracted the relevant portions:

; 16(%rsp) is a value obtained from standard input
movss   16(%rsp), %xmm0
andps   .LC5(%rip), %xmm0 ; .LC5 == 2147483647
movq    %rbp, %rdi
cvtss2sd    %xmm0, %xmm0

movl    16(%rsp), %eax
movq    %rbp, %rdi
andl    $2147483647, %eax
movd    %eax, %xmm0
cvtss2sd    %xmm0, %xmm0

Observations

I'm not great at assembly, but the main thing I noticed is that the standard function operates directly on the xmm0 register. But with mine, it first moves the value to eax (for some reason), performs the and, and then moves it into xmm0. I'm assuming the extra mov is where the slow down happens. I also noticed that, for the standard, it stores the bit mask elsewhere in the program vs an immediate. I'm guessing that's not significant, however. The two versions also use different instructions (e.g. movl vs movss).

System info

This was compiled with g++ on Debian Linux (unstable branch). g++ --version output:

g++ (Debian 10.2.1-6) 10.2.1 20210110

If these two versions of the code both calculate absolute value the same way (via an and), why doesn't the optimizer generate the same code? Specifically, why does it feel the need to include an extra mov when it optimizes my implementation?

Did you look at the g++ source code to see how that compiler implements `abs`? Also, `abs` is not "built-in" as you suggest. Yes, `abs` is a standard function, but it is a function that is no different than any other function. You can see the source code to `abs`. — PaulMcKenzie, Feb 03 '21 at 08:07
Seems you have several `mov`s, did you tried to just cast and apply the mask in row? — Adrian Maire, Feb 03 '21 at 08:18
Not sure why but in MSVC, your implementation is way faster in debug but equal in the release. — D-RAJ, Feb 03 '21 at 08:20
@D-RAJ doing performance comparisons in debug is meaningless. They might have some checkings active in debug mode that are not present in the release build. — t.niese, Feb 03 '21 at 08:23
@WolfgangLorenz if you mean the `inline` keyword, then you should do some research about what `inline` means nowadays. `inline` is important if you define a function in the header, and use that header in multiple compilations units. But for modern compilers the `inline` keyword alone does not change the probability whether it will inline a function or not. — t.niese, Feb 03 '21 at 08:30
@t.niese I think the point is that benchmarking the function in isolation may be less meaningful that benchmarking it in an application where it is inlined in some callers. Indeed, most of the instructions are about moving things around to match the calling convention and would disappear after inlining. — Marc Glisse, Feb 03 '21 at 08:53
@t.niese I just read about inline and you are right. I had somehow totally missed that one. Thank you. — Xoozee, Feb 03 '21 at 08:54
What is your platform and compilation flags? `float` argument is typically passed to a function call in `xmm0` so I don't understand why anything is loaded from stack in your case. I got quite different machine code: https://godbolt.org/z/PTa9dr. I guess the problem is that you perform the AND operation with an integer data type, that's why the compiler does AND on integer registers instead of FP ones. — Daniel Langr, Feb 03 '21 at 09:06
@DanielLangr Compiling on Debian Linux with GCC 10.2.1. That makes sense about it using integer registers. I guess it's not smart enough to realize it's actually float data. — Lysol, Feb 03 '21 at 09:27
@PaulMcKenzie I get you. Would it be more accurate then to say "standard"? — Lysol, Feb 03 '21 at 09:30
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98962 for not using andps (but that's not exactly what you are asking in your question) — Marc Glisse, Feb 03 '21 at 21:09
`1` should be `1u` in the mask initialization, otherwise it is implementation-defined behaviour (left-shifting a 1 into the sign bit) — M.M, Feb 04 '21 at 06:17

Daniel Langr · Accepted Answer · 2021-02-03T10:08:34.537

7

I got a bit different assembly. According to the x86_64 Linux ABI, a float argument is passed via xmm0. With standard fabs, the bitwise AND operation is performed directly on this register (Intel syntax):

andps xmm0, XMMWORD PTR .LC0[rip] # .LC0 contains 0x7FFFFFFF
ret

However, in your case, the bitwise AND is performed on objects of type unsigned int. Therefore, GCC does the same which requires to move xmm0 to eax first:

movd eax, xmm0
and  eax, 2147483647
movd xmm0, eax
ret

Live demo: https://godbolt.org/z/xj8MMo

I haven't found any way how to force the GCC optimizer to perform AND directly on xmm0 with only pure C/C++ source code. It seems that efficient implementations need to be built upon assembler code or Intel intrinsic.

Relevant question: How to perform a bitwise operation on floating point numbers. All the proposed solutions basically result in the same outcome.

I also tried to use the copysign function, but the result was even worse. The generated machine code then conatiend x87 instructions.

Anyway, it is quite interesting that the Clang optimizer was clever enough to make the assembly in all 3 cases equivalent: https://godbolt.org/z/b6Khv5.

edited Feb 03 '21 at 10:08

answered Feb 03 '21 at 09:59

Daniel Langr

22,196
3
50
93

1

How did you manage to get x87 instructions with copysign? Gcc recognizes `copysignf(x,1.)` as `fabsf(x)`. – Marc Glisse Feb 03 '21 at 21:14
@MarcGlisse No, it does not, at least in my case: https://godbolt.org/z/KcsGx8. You need to distinguish between `fabs` function and x87 `fabs` instruction. – Daniel Langr Feb 04 '21 at 04:25
It helps if you use copysignf (or std::copysign) instead of plain copysign. Using gcc trunk also helps. Still, weird that there can be x87 instructions with copysign... And I cannot reproduce it locally, maybe godbolt has a strange setup. – Marc Glisse Feb 04 '21 at 08:13
@MarcGlisse This question is not about `copysign` and the discussion about it is off-topic here. However, it might be a good topic for a separate question instead. – Daniel Langr Feb 04 '21 at 08:42

Basile Starynkevitch · Answer 2 · 2021-02-04T06:12:55.600

Why is the standard “abs” function faster than mine?

Because with most optimizing compilers (in particular GCC or Clang), it would use a specialized machine instruction known by the compiler

The GCC compiler has even a builtin for abs

Be sure to compile with gcc -O3 and perhaps -ffast-math.

You could study the assembler code: compile your example.c as gcc -Wall -O3 -ffast-math -fverbose-asm example.c and look inside the emitted example.s assembler file.

On Linux systems (e.g. Debian), you could study the source code of GNU libc and look inside the math.h standard header (and use g++ -O3 -C -E to get the preprocessed form)

Why is the standard "abs" function faster than mine?

Assembly output

Observations

System info

2 Answers2