7

Here is what I'm trying to acheive. It's simple enough:

unsigned int foo1(bool cond, unsigned int num)
{
    return cond ? num : 0;
}

Assmebly:

    test    dil, dil
    mov     eax, 0
    cmovne  eax, esi
    ret

My question is, is there a faster way to do it? Here are some ways I thought of:

Using multiplication:

unsigned int foo2(bool cond, unsigned int num)
{
    return cond * num;
}

Assmbly:

    movzx   eax, dil
    imul    eax, esi
    ret

Using memory access:

unsigned int foo3(bool cond, unsigned int num)
{
    static const unsigned int masks[2] = { 0x0, 0xFFFFFFFF };
    return masks[cond] & num;
}

Assembly:

    movzx   edi, dil
    mov     eax, DWORD PTR foo3(bool, unsigned int)::masks[0+rdi*4]
    and     eax, esi
    ret

Using some bit tricks:

unsigned int foo4(bool cond, unsigned int num) 
{
    return (0 - (unsigned)cond) & num;
}

Assembly:

    movzx   eax, dil
    neg     eax
    and     eax, esi
    ret

Now, multiplication yields the least instructions, I think it's the best choice, but I'm not sure about the imul. Any suggestions?

Thanks in advance,

Elad Weiss
  • 3,662
  • 3
  • 22
  • 50
  • 1
    The `imul` instruction may take quite some time (I'm not sure though). Memory access is probably the worst. Can you benchmark it? Is this really critical enough to be worth to be optimized? – Jabberwocky Nov 18 '16 at 08:58
  • 2
    `less instructions != more performance`. It really depends on how much cycle(s) each instruction take (plus may be how well instruction level parallelism works). `imul` instruction could very well take more number of instructions to execute than a simple test. Your best bet AFAIK is to use `__builtin_expect` to help the branch predictor (if that helps at all). – Arunmu Nov 18 '16 at 08:58
  • 5
    This totally ignores instruction latency and caching mechanisms, I recommend profiling. – Marco A. Nov 18 '16 at 08:59
  • 2
    My two-pence worth: the conditional approach has very poor performance if the calling pattern (i.e. the value of `cond` in multiple invocations of `foo1`) is not very predictable (read: branch predictor does poorly). If that is the case any of the approaches that do not use it will be significantly better. As pointed out the fact of this is best established by profiling – Smeeheey Nov 18 '16 at 09:06
  • 2
    Since C and C++ don't specify such things, I'm assuming this question is trying to be about specific systems and machine code. Please specify which and update the tags accordingly. – 2501 Nov 18 '16 at 09:38
  • As you can see in the assembly output for all your tests, using `bool` instead of `int` is counterproductive. You should try defining `cond` as `int`, with a boolean value (`0` or `1`). – chqrlie Nov 21 '16 at 23:38

3 Answers3

1

Multiplications and memory accesses take frequently more time than a simple if statement. If you want to optimize this code, the best way would be to use only "and" or "or" instructions (set it as inline to avoid a function call by the way).

Here is an 'optimized' example of your function using masks instead of booleans :

inline unsigned int foo1(unsigned int mask, unsigned int num)
{
  return mask & num;
}

Your call would look like this :

foo1(0, 10);     /* Returns 0  */
foo1(~0, 10);    /* Returns 10 */
K.Hacene
  • 125
  • 4
  • 2
    You can also keep the original signature with `bool cond` if you write `-cond & num`. Using Clang, this actually compiles to the same code as the first example, with `cmovne`. – Jon Purdy Nov 18 '16 at 10:21
  • There's no telling if the condition is at all related to the variable value itself though. I don't see how bit masking would make sense... seems like obfuscation to me. – Lundin Nov 18 '16 at 10:52
  • Actually I don't like this answer. The question was specific for boolean. If I have a mask then there is no question. – Elad Weiss Nov 18 '16 at 12:07
1

Optimizing code isn't always as easy as counting assembler instructions and CPU ticks.

The multiplication method is likely the fastest on most systems, since it removes a branch. The multiplication instruction should be reasonably fast on most CPU cores.

What you could consider though, is if you really need to use such large integer types. On small 8 or 16 bit CPUs, the following code would be significantly faster:

uint_fast16_t foo2(bool cond, uint_fast16_t num)
{
    return (uint_fast16_t)cond * num;
}

On the other hand, such CPUs rarely come with branch prediction or instruction cache.

You shouldn't need to worry about manual function inlining. The compiler will inline this function automatically on most compilers.

Lundin
  • 195,001
  • 40
  • 254
  • 396
  • 2
    *"The multiplication method is likely the fastest on most systems, since it removes a branch."* A conditional move is *not* a branch as far as the branch predictor is concerned, [see here](http://stackoverflow.com/a/14131292/3764814). Don't guess what's faster, just benchmark, it's the only way to be sure. – Lucas Trzesniewski Nov 18 '16 at 12:33
  • @LucasTrzesniewski Of course it doesn't need to translate into a branch. My point here is that even if it doesn't, this method is likely the fastest anyway, since MUL instructions are fast. – Lundin Nov 18 '16 at 12:37
0

Afer viewing all wisening answers and comments,

I believe this is the correct answer:

When getting to such levels of micro-optimizatin, there is no one 'best' choice, as it may vary depending on platform, OS and the written software.

So, it seems to me the correct approach software-wise would be to create more than one implementation, and encapsulate them with some abstraction, so they can be easily switched.

When benchmarking, switch between them to see which one yields best results for the SITUATION.

Of course we can rule out solutions which are obviously worse than others.

Elad Weiss
  • 3,662
  • 3
  • 22
  • 50
  • 1
    This makes sense, just make sure your abstraction doesn't add another function call. It should be based on #define etc. – noamtm Nov 20 '16 at 16:17