Why does an inline function have lower efficiency than an in-built function?

Question

I was trying a question on arrays in InterviewBit. In this question I made an inline function returning the absolute value of an integer. But I was told that my algorithm was not efficient on submitting it. But when I changed to using abs() from C++ library it gave a correct answer verdict.

Here is my function that got an inefficient verdict -

inline int abs(int x){return x>0 ? x : -x;}

int Solution::coverPoints(vector<int> &X, vector<int> &Y) {
    int l = X.size();
    int i = 0;
    int ans = 0;
    while (i<l-1){
        ans = ans + max(abs(X[i]-X[i+1]), abs(Y[i]-Y[i+1]));
        i++;
    }
    return ans;
}

Here's the one that got the correct answer -

int Solution::coverPoints(vector<int> &X, vector<int> &Y) {
    int l = X.size();
    int i = 0;
    int ans = 0;
    while (i<l-1){
        ans = ans + max(abs(X[i]-X[i+1]), abs(Y[i]-Y[i+1]));
        i++;
    }
    return ans;
}

Why did this happened, as I thought that inline functions are fastest as no calling is done? Or is the site having an error? And if the site is correct, what does C++ abs() use that is faster than inline abs()?

You might want to email that InterviewBit site (hello@interviewbit.com) and report a bug. Send them a link to this question so they can improve their site, or alternatively (if this is not a bug), send us some debugging data (e.g. test case where the code fails). — anatolyg, Jul 09 '17 at 09:54
Are you allergic to `for` loops? Why would you write that `while` when you could have written the more idiomatic `for (int i=0; i < l-1; i++) {`? (You might also hoist the `l-1` calculation out of the loop condition.) Also, `ans += max(...)` would be good style. — Peter Cordes, Jul 09 '17 at 14:50
Besides the `for` loop and `+=`, you should get in the habit of using `++i` not postfix (though it doesn’t matter here), and note that Stroustrup teaches that you put the pointer modifier with the *type* not the *variable*: `char* p`. This is literally the first thing in his original book introducing C++ to the public. Oh, and don’t name something `l`. Even when it's used in the language as a suffix the capital is chosen to avoid an ambiguous-looking letter. — JDługosz, Jul 10 '17 at 02:12
Your X and Y arguments are not `const`. Did you mean to say that you’re allowed to destroy them while working? — JDługosz, Jul 10 '17 at 02:18

score 27 · Answer 1 · edited Jul 10 '17 at 03:52

27

I don't agree with their verdict. They are clearly wrong.

On current, optimizing compilers, both solutions produce the exact same output. And even, if they didn't produce the exact same, they would produce as efficient code as the library one (it could be a little surprising that everything matches: the algorithm, the registers used. Maybe because the actual library implementation is the same as OP's one?).

No sane optimizing compiler will create branch in your abs() code (if it can be done without a branch), as other answer suggests. If the compiler is not optimizing, then it may not inline library abs(), so it won't be fast either.

Optimizing abs() is one of the easiest thing to do for a compiler (just add an entry for it in the peephole optimizer, and done).

Furthermore, I've seen library implementations in the past, where abs() were implemented as a non-inline, library function (it was long time ago, though).

Proof that both implementations are the same:

GCC:

myabs:
    mov     edx, edi    ; argument passed in EDI by System V AMD64 calling convention
    mov     eax, edi
    sar     edx, 31
    xor     eax, edx
    sub     eax, edx
    ret

libabs:
    mov     edx, edi    ; argument passed in EDI by System V AMD64 calling convention
    mov     eax, edi
    sar     edx, 31
    xor     eax, edx
    sub     eax, edx
    ret

Clang:

myabs:
    mov     eax, edi    ; argument passed in EDI by System V AMD64 calling convention
    neg     eax
    cmovl   eax, edi
    ret

libabs:
    mov     eax, edi    ; argument passed in EDI by System V AMD64 calling convention
    neg     eax
    cmovl   eax, edi
    ret

Visual Studio (MSVC):

libabs:
    mov      eax, ecx    ; argument passed in ECX by Windows 64-bit calling convention 
    cdq
    xor      eax, edx
    sub      eax, edx
    ret      0

myabs:
    mov      eax, ecx    ; argument passed in ECX by Windows 64-bit calling convention 
    cdq
    xor      eax, edx
    sub      eax, edx
    ret      0

ICC:

myabs:
    mov       eax, edi    ; argument passed in EDI by System V AMD64 calling convention 
    cdq
    xor       edi, edx
    sub       edi, edx
    mov       eax, edi
    ret      

libabs:
    mov       eax, edi    ; argument passed in EDI by System V AMD64 calling convention 
    cdq
    xor       edi, edx
    sub       edi, edx
    mov       eax, edi
    ret

See for yourself on Godbolt Compiler Explorer, where you can inspect the machine code generated by various compilers. (Link kindly provided by Peter Cordes.)

edited Jul 10 '17 at 03:52

Cody Gray - on strike

239,200
50
490
574

answered Jul 09 '17 at 09:35

geza

28,403
6
61
135

Even though you've carefully picked some cases when compilers emit the same code for built-in `abs()` and custom `abs` it does not anyhow make my statement wrong. I can easily provide some examples of the opposite: [gcc 4.8.5 and clang 4.0 emit different code](https://godbolt.org/g/opZmpp). If you try to write custom implementation of more sophisticated library functions then making compiler emit the same code would be a rather non-trivial task. Also I'm the one who is still using tcc, and it works fine as an embedded compiler. – user7860670 Jul 09 '17 at 11:27
3

@VTT: carefully picked? The question was about `abs()`: "what does c++ abs() use that is faster than inline abs()?" – geza Jul 09 '17 at 11:36
@VTT: what you linked is simply wrong. Here, you use the library the wrong way. Actually, using this way the library is **much** slower than the hand written way. Hint: there are int->float->int conversions... use `abs()` from stdlib.h instead. – geza Jul 09 '17 at 11:40
@geze Yes, it seems like you carefully picked cases when compilers produce the same code for c++ abs() and inline abs() (from first post). Also you didn't even post any godblot links to verify it. And I'm using library correctly. There is an ::std::abs overload taking integer types, though it does deal with floating-point numbers internally (at least on gcc it just calls `__builtin_fabs`). – user7860670 Jul 09 '17 at 11:48
@VTT: I've edited my answer a little bit, so it cannot be misunderstood – geza Jul 09 '17 at 11:51
@VTT: No, you use the library incorrectly. As I said, use `abs()` from stdlib.h (or cstdlib). Using the floating point version is simply bad. For two reasons: it is much **slower**, and it can be imprecise (for 64-bit numbers). – geza Jul 09 '17 at 12:03
@VTT: here is the [ICC](https://godbolt.org/g/M4cHDt) result. For the others, I don't have links, because I did tests on my computer (besides godbolt doesn't support for VC). – geza Jul 09 '17 at 12:09
Alright, it actually turns out that g++ and clang for whatever reason don't include the right version of `abs` in cmath even if -std=c++1z is used to compile. And another thing is that build-in abs is implemented as `return __x >= 0 ? __x : -__x;` (at least on gcc), so I guess it was a rather pointless to list all that assembly... But that still does not justify writing custom `abs` implementation. – user7860670 Jul 09 '17 at 13:23
@geza thanks for your answer.Sorry i can i mark only one correct ans so i couldnt mark yours. Actually i had also read somewhere that return x>=0? Type statements dont produce memory leaks(or dont produce branching statements) but now i realised that only some compilers do so – monster Jul 09 '17 at 13:36
1

geza: Matt Godbolt added MSVC's CL19 x86 and x86-64 a few months ago :) See all 4 compilers, with the `#include ` that @VTT was missing for the integer overload of `std::abs`, at https://godbolt.org/g/uwtXTK. All using Intel syntax; IDK why you didn't use `-masm=intel` on your desktop for generating consistent asm for your answer. – Peter Cordes Jul 09 '17 at 14:37
@PeterCordes: thanks, I didn't noticed that! Usually I don't bother with intel syntax, I. First I checked gcc and clang, and I was lazy to rerun disassembly. But you are absolutely right, it would have been better if all disassembly would have used the same syntax. Thanks for noticing this! – geza Jul 09 '17 at 15:19
@monster: no problem, there can be only one accepted answer. What do you mean by memory leaks? I don't think you can find any current optimizing compiler which puts branching statements for your `abs()` function, when `abs()` can be done without branching (there are some old CPUs, where you cannot implement `abs()` without branching, but this is a whole different case) – geza Jul 09 '17 at 15:27
@geza I had a similar case (for min and max) and found that the 32-bit version was *not* optimized (by Microsoft) while the other sizes generated `cmov` as expected. http://www.dlugosz.com/zeta/?p=407 way at the bottom. – JDługosz Jul 09 '17 at 19:57
@JDługosz: there are no branches in that code (I see `cmovge`). And you seem to misunderstand the asm instruction `lea`. `lea` doesn't do any memory accesses. There are 4 memory operations in that code. 2 reads, 2 writes. Or maybe you referred to some other code? I've checked the last one on that page. – geza Jul 09 '17 at 20:12
@geza sorry my memory was inaccurate in detail. What I recalled is that a simple ternary expression became a **simple** `cmov` except for one of the primitive types which generated more complex code. That wasn’t *branching* though, meerly a level of indirection and additional registers. Point is, sometimes the compiler does other than expected for *specific cases* even if it appears to do one particular thing. – JDługosz Jul 10 '17 at 02:01
Would be nice if all the assembly code presented used intel syntax for consistency. – code_dredd Jul 10 '17 at 03:04
For what it's worth, it's only been very recently that MSVC has learned to take advantage of CMOV instructions for conditional operations, so functions like `min`/`max` used to be quite inefficient. (To be fair, though, this included the library functions; they were not optimized in any special way over what you would write yourself, so that wouldn't really explain much in this case.) However, `abs` has long been a special case, generating the `CDQ`+`XOR`+`SUB` instructions that you show here. Even Visual C++ 4.2 recognized the absolute value idiom. :-) And probably further back, as well. – Cody Gray - on strike Jul 10 '17 at 03:45
FWIW the CMOV is only faster when the branch is not easily predictable. IIRC when the number is positive maybe 10% of the time, then the conditional branch gets mostly correctly predicted and is faster than CMOV. FWIW Java hotspot JIT gathers statistics and decides accordingly (with some bias towards CMOV). – maaartinus Jul 10 '17 at 05:03
@maaartinus: I don't have that much experience with CMOV. Do you have some code snippet, where CMOV actually slower? I would be happy to play with that code :) Thanks! – geza Jul 10 '17 at 09:09
I've found [this answer](https://stackoverflow.com/q/21055946/581205), linked from [this answer of mine](https://stackoverflow.com/a/22755044/581205). IIRC the reason is that the correctly predicted branch gets executed in parallel with other instructions without occupying an ALU unit. `+++` Above I was wrong, Java favors branching, which leads to [Strange branching performance](https://stackoverflow.com/q/19689214/581205). – maaartinus Jul 10 '17 at 09:23
CMOV is *definitely* slower than a well-predicted branch on Intel processors. CMOV has a high latency on Intel, and takes a *minimum* of 2 cycles to execute. That isn't the case on AMD, though; CMOV is more efficient there, and may very well be comparable to a well-predicted branch. See also [this answer](https://stackoverflow.com/a/44759962/366904). Sorry, I don't have sample code on hand to play with, but it would be easy to write something up. But, if you're uncertain how well the branch will be predicted, CMOV is still your best/safest bet — *far* better worst-case performance. – Cody Gray - on strike Jul 10 '17 at 10:56
@CodyGray: From Broadwell, CMOV has a 1 cycle latency according to Agner's table. So maybe, on these processors, CMOV is never slower than a branch? – geza Jul 10 '17 at 11:16
Maybe. I'm not sure about that. The newest chip I've had access to is a Haswell. I kind of doubt it, though. Even a 1-cycle latency is slower than a correctly-predicted branch (which is free). CMOV also lengthens the dependency chain, whereas a branch resets it. But it *might* average out to be faster than a 95% predicted branch, or something arbitrary threshold like that. – Cody Gray - on strike Jul 10 '17 at 11:34

user7860670 · Accepted Answer · 2017-07-09T19:51:19.013

20

Your abs performs branching based on a condition. While the built-in variant just removes the sign bit from the integer, most likely using just a couple of instructions. Possible assembly example (taken from here):

cdq
xor eax, edx
sub eax, edx

The cdq copies the sign of the register eax to register edx. For example, if it is a positive number, edx will be zero, otherwise, edx will be 0xFFFFFF which denotes -1. The xor operation with the origin number will change nothing if it is a positive number (any number xor 0 will not change). However, when eax is negative, eax xor 0xFFFFFF yields (not eax). The final step is to subtract edx from eax. Again, if eax is positive, edx is zero, and the final value is still the same. For negative values, (~ eax) – (-1) = –eax which is the value wanted.

As you can see this approach uses only three simple arithmetic instructions and no conditional branching at all.

Edit: After some research it turned out that many built-in implementations of abs use the same approach, return __x >= 0 ? __x : -__x;, and such a pattern is an obvious target for compiler optimization to avoid unnecessary branching.

However, that does not justify the use of custom abs implementation as it violates the DRY principle and no one can guarantee that your implementation is going to be just as good for more sophisticated scenarios and/or unusual platforms. Typically one should think about rewriting some of the library functions only when there is a definite performance problem or some other defect detected in existing implementation.

Edit2: Just switching from int to float shows considerable performance degradation:

float libfoo(float x)
{
    return ::std::fabs(x);
}

andps   xmm0, xmmword ptr [rip + .LCPI0_0]

And a custom version:

inline float my_fabs(float x)
{
    return x>0.0f?x:-x;
}

float myfoo(float x)
{
    return my_fabs(x);
}

movaps  xmm1, xmmword ptr [rip + .LCPI1_0] # xmm1 = [-0.000000e+00,-0.000000e+00,-0.000000e+00,-0.000000e+00]
xorps   xmm1, xmm0
xorps   xmm2, xmm2
cmpltss xmm2, xmm0
andps   xmm0, xmm2
andnps  xmm2, xmm1
orps    xmm0, xmm2

online compiler

edited Jul 09 '17 at 19:51

answered Jul 09 '17 at 09:05

user7860670

35,849
4
58
84

5

Most importantly, the built in will not have branch prediction failure on random data, thus resulting in instruction cache friendlier code. – StoryTeller - Unslander Monica Jul 09 '17 at 09:21
4

@StoryTeller: There is no such thing as branch prediction failure on conditional moves. – Damon Jul 09 '17 at 09:35
@Damon - Does the online judge where the OP submitted the code use a conditional move? I do believe we cannot say. – StoryTeller - Unslander Monica Jul 09 '17 at 09:38
1

@StoryTeller If the online judge cares about efficiency yet does not use a decent optimising compiler, what we can say is that's the fault of the online judge, not of the OP. – Jul 09 '17 at 09:43
4

No optimizing compiler will emit a branch for the OP's abs(). – geza Jul 09 '17 at 09:44
@geza Your statement is rather bold. The chance that compiler provides nice built-ins is much higher than a chance that compiler will optimize hand-written code just as good. – user7860670 Jul 09 '17 at 09:53
@VTT: do you know any compiler, for which your statement is true? – geza Jul 09 '17 at 09:56
@VTT: Reinventing the wheel may (in such a trivial case) be a valid option, though. Note that I'm not advocating to rewrite the standard library for no reason, but it is not _always_ a mistake, and turning down a solution just because of that is bad. If a trivial one-expression function saves you from including a header which you wouldn't otherwise need in a critical spot, that may -- on a large project -- save minutes or hours of dev time per month not spent compiling. Developer time is money, you know. Why are people using nasty stuff like PIMPL? Well, for the same reason. Time is money. – Damon Jul 09 '17 at 10:00
1

@geza gcc, tcc (though this one can hardly be called optimizing), clang, Intel c++, MS VS – user7860670 Jul 09 '17 at 10:11
1

@Damon First I admit that I've removed the line about reinventing the wheel from my comment, but it happened before you posted yours. What you are suggesting is just a direct violation of DRY principle. When people start copypasting just to avoid including one header to save compilation time it is a direct sign of poor development quality. – user7860670 Jul 09 '17 at 10:17
@VTT: care to see my answer with proofs that you're wrong about this? tcc is an ancient (possibly non-optimizing) compiler, no one cares about it. I don't have access to icc, but I am absolutely sure that it will generate optimal code for abs(). For the other compilers, there are proofs in my answer. – geza Jul 09 '17 at 10:21
@geza ICC can be tested online on https://gcc.godbolt.org/ if you care. – Jul 09 '17 at 10:40
3

@VTT - Slightly offtrack but I'll add a comment etiher way....I'm a big fan of DRY, but when we consider rules which cannot be broken or bent we often end in an equally poor development quality situation. DRY does couple code, it is also a balance we need to consider. – Jamie Jul 09 '17 at 14:08
3

About float version: it is because the compiler uses strict floating point arithmetic, so it cannot optimize that code away . If you add `-ffast-math`, you'll get an equally good solution. – geza Jul 09 '17 at 15:31
2

The float example is invalid because the functions you are comparing don't do the same thing. – Baum mit Augen Jul 09 '17 at 20:13
@BaummitAugen Well, writing more sophisticated fabs implementation handling special values most likely would lead to even worse code generated. – user7860670 Jul 09 '17 at 20:36

Damon · Answer 3 · 2017-07-10T10:39:09.460

Your solution might arguably be "cleaner" by the textbook if you used the standard library version, but I think the evaluation is wrong. There is no truly good, justifiable reason for your code being rejected.

This is one of those cases where someone is formally correct (by the textbook), but insists on knowing the only correct solution in a sheer stupid way rather than accepting an alternate solution and saying "...but this here would be best practice, you know".

Technically, it's a correct, practical approach to say "use the standard library, that's what it is for, and it's likely optimized as much as can be". Even though the "optimized as much as can be" part can, in some situations, very well be wrong due to some constraints that the standard puts onto certain alogorithms and/or containers.

Now, opinions, best practice, and religion aside. Factually, if you compare the two approaches...

int main(int argc, char**)
{
  40f360:       53                      push   %rbx
  40f361:       48 83 ec 20             sub    $0x20,%rsp
  40f365:       89 cb                   mov    %ecx,%ebx
  40f367:       e8 a4 be ff ff          callq  40b210 <__main>
return std::abs(argc);
  40f36c:       89 da                   mov    %ebx,%edx
  40f36e:       89 d8                   mov    %ebx,%eax
  40f370:       c1 fa 1f                sar    $0x1f,%edx
  40f373:       31 d0                   xor    %edx,%eax
  40f375:       29 d0                   sub    %edx,%eax
//}

int main(int argc, char**)
{
  40f360:       53                      push   %rbx
  40f361:       48 83 ec 20             sub    $0x20,%rsp
  40f365:       89 cb                   mov    %ecx,%ebx
  40f367:       e8 a4 be ff ff          callq  40b210 <__main>
return (argc > 0) ? argc : -argc;
  40f36c:       89 da                   mov    %ebx,%edx
  40f36e:       89 d8                   mov    %ebx,%eax
  40f370:       c1 fa 1f                sar    $0x1f,%edx
  40f373:       31 d0                   xor    %edx,%eax
  40f375:       29 d0                   sub    %edx,%eax
//}

... they result in exactly the same, identical instructions.

But even if the compiler did use a compare followed by a conditional move (which it may do in more complicated "branching assignments" and which it will do for example in the case of min/max), that's maybe one CPU cycle or so slower than the bit hacks, so unless you do this several million times, the statement "not efficient" is kinda doubtful anyway.
One cache miss, and you have a hundred times the penalty of a conditional move.

There are valid arguments for and against either approach, which I won't discuss in length. My point is, turning down the OP's solution as "totally wrong" because of such a petty, unimportant detail is rather narrow-minded.

EDIT:

(Fun trivia)

I just tried, for fun and no profit, on my Linux Mint box which uses a somewhat older version of GCC (5.4 as compared to 7.1 above).

Due to me including <cmath> without much of a thought (hey, a function like abs very clearly belongs to math, doesn't it!) rather than <cstdlib> which hosts the integer overload, the result was, well... surprising. Calling the library function was much inferior to the single-expression wrapper.

Now, in defense of the standard library, if you include <cstdlib>, then, again, the produced output is exactly identical in either case.

For reference, the test code looked like:

#ifdef DRY
  #include <cmath>
  int main(int argc, char**)
  {
     return std::abs(argc);
  }
#else
  int abs(int v) noexcept { return (v >= 0) ? v : -v; }
  int main(int argc, char**)
  {
     return abs(argc);
  }
#endif

...resulting in

4004f0: 89 fa                   mov    %edi,%edx
4004f2: 89 f8                   mov    %edi,%eax
4004f4: c1 fa 1f                sar    $0x1f,%edx
4004f7: 31 d0                   xor    %edx,%eax
4004f9: 29 d0                   sub    %edx,%eax
4004fb: c3                      retq

Now, It is apparently quite easy to fall into the trap of unwittingly using the wrong standard library function (I demonstrated how easy it is myself!). And all that without the slightest warning from the compiler, such as "hey, you know, you're using a double overload on an integer value (well, obviously there's no warning, it's a valid conversion).

With that in mind, there may be yet another "justification" why the OP providing his own one-liner wasn't all that terribly bad and wrong. After all, he could have made that same mistake.

So the library implementation you use has a naive `abs`. So what? Your'e not *guaranteed* to always get a super-optimized version, but you *can*. And that makes a huge difference, like the OP's code. — StoryTeller - Unslander Monica, Jul 09 '17 at 09:26
And the problem is not with the extra cycle of the branch. The problem is with the extra cycle of the branch + extra cycles of branch prediction failure + instruction cache getting dirty. That can impose some severe penalties, depending on the distribution of data. A "bit-hack" will not. — StoryTeller - Unslander Monica, Jul 09 '17 at 09:29
@StoryTeller It's the other way around. The custom naive `abs` is detected by the compiler as equivalent to `abs` and replaced by the same instructions that get generated for the library `abs`. — , Jul 09 '17 at 09:40
@StoryTeller: As above, do note that there is no branch prediction failure. Please familiarize yourself with the facts prior to making such statements. CMOV does not branch, nor predict. You have a dependency chain on the compare, and that's it. But you **also** have a dependency chain on the 4 instructions generated by my compiler, and on the 3 instructions in VTT's super-optimized assembly snippet. The practical difference will be zero or one cycle. If there is a difference, it is only measurable when the result is needed immediately after. Otherwise pipelining eliminates it. — Damon, Jul 09 '17 at 09:40
@hvd - That's more plausible. But that just gets us back to the point of not knowing what the online judge actually does when building the code, to generate such a performance difference (which empirically exists). — StoryTeller - Unslander Monica, Jul 09 '17 at 09:43
You've stepped into the same trap as VTT. Use cstdlib instead of cmath. — geza, Jul 09 '17 at 12:12
@damon thanks for your answer. Sorry i cant mark multiple answers correct:) — monster, Jul 09 '17 at 13:37
The scalar `double` is a constant with all bits except the sign bit set. [bitwise-AND to clear the sign bit is how you implement absolute-value for an float/double in an SSE register](https://stackoverflow.com/questions/32408665/fastest-way-to-compute-absolute-value-using-sse). The `00` on a line by itself is line-wrapping because `movsd` is long. Use `objdump -dw` to disable wrapping long instructions. (I use `alias disas='objdump -drwC -Mintel'`). And as geza said, this is only happening because you forgot to `#include ` for the integer overloads of `std::abs()`. Silly C++. — Peter Cordes, Jul 09 '17 at 14:43
I can't really bring myself to upvote this answer until you remove the last edit, which has already been pointed out is irrelevant/invalid due to a failure to include the appropriate header file. I considered removing it myself, but it's a pretty big chunk, so I wasn't comfortable exercising that much discretion. (It *might* still be worth mentioning this, though, as it's highly possible that others will make the same mistakes regarding header file inclusion.) — Cody Gray - on strike, Jul 10 '17 at 03:57
@CodyGray: I'm not very much in favour of removing/hiding stuff -- _quod scripsi, scripsi_. But I've edited the edit to reflect that the claim was wrong, after verifying with the correct include file. The fact that I picked the wrong overload (after ~26 years of C++) is kinda telling of the dangers ahead :-) — Damon, Jul 10 '17 at 10:48

Why does an inline function have lower efficiency than an in-built function?

3 Answers3

Linked