23

I need a program to get the smaller of two numbers, and I'm wondering if using a standard "if x is less than y"

int a, b, low;
if (a < b) low = a;
else low = b;

is more or less efficient than this:

int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));

(or the variation of putting int delta = a - b at the top and rerplacing instances of a - b with that).

I'm just wondering which one of these would be more efficient (or if the difference is too miniscule to be relevant), and the efficiency of if-else statements versus alternatives in general.

  • I know the title isn't very good, but I can't currently think of a better one. –  Jun 09 '10 at 20:32
  • 11
    This will depend greatly on your compiler and target CPU. I doubt that there's a generally true answer. Did you try benchmarking? – Eric J. Jun 09 '10 at 20:33
  • 8
    Any speed difference is negligible in this case. The efficiency in maintenance seems obvious. – Marc Jun 09 '10 at 20:35
  • 1
    You're most likely wasting your time - you shouldn't try to optimize fragment like this unless you CLEARLY identified it (using profiler) as a single major bottleneck (50+% of execution time) in your project. – SigTerm Jun 09 '10 at 20:39
  • 30
    FFS people, he didn't ask your opinion on when to optimize, just some technical details about two separete approaches. – Justicle Jun 09 '10 at 20:46
  • Very similar question: http://stackoverflow.com/questions/315306/is-if-expensive – Adam Rosenfield Jun 09 '10 at 21:01
  • 1
    both GCC and LLVM do a surprisingly good job of making branch-less versions of things like your first answer. So I would just go with the more clear code. – Evan Teran Jun 09 '10 at 21:05
  • 1
    @Justicle: The only answer to his actual question is "it depends" and asking it suggests that he may b well served by some observations on when and why you should even consider this kind of question. – dmckee --- ex-moderator kitten Jun 09 '10 at 21:41
  • 10
    With a decent compiler, `min(a,b)` should give you the optimal code - possibly faster than either, if it can use machine instructions that aren't directly available from C. Also, the second version isn't as portable, since right-shifting a negative value gives an implementation-defined result. – Mike Seymour Jun 09 '10 at 22:57
  • 2
    @SigTerm: that's an absurdly high standard. There isn't any one thing in my app's entire main loop that costs more than 3% of execution time, and yet I need to somehow speed the whole thing up by 15% in order to meet its latency requirement. – Crashworks Jun 09 '10 at 23:53
  • @Crashworks: This isn't an absurdly high standard. IT is realistic. It is easy to get distracted and start "optimizing" code without profilers, get zero performance increase or even make it slower. If you make your 3% routine 20% faster (which isn't bad), that will save you only 0.5% of execution time. This is almost pointless and hard to notice. If there is nothing left to optimize, you need another algorithm or better hardware. If there is no piece of code that takes at least 20% of time (or more), it will be hard to optimize it further. – SigTerm Jun 10 '10 at 00:21
  • 4
    Or, you need to optimize a lot of things by a little bit apiece. That's the reality of achieving performance on fixed hardware. – Crashworks Jun 10 '10 at 00:30
  • 4
    @Crashworks: I recently had the incredulous opportunity to optimize something ([Hough transform](http://en.wikipedia.org/wiki/Hough_transform) running on ARM Android) from 140 msecs down to 8 msecs, a time reduction of 94%. And a conditional branch depending on a floating-point comparison is the culprit. Things like this happen rarely, yet for library writers (like OpenCV) and compiler writers (like GCC), it is a wake-up call. Your mathematical analysis is correct, though, and I recommend it. A (3% routine, 20% faster) is not going to justify the effort. A (30% routine, 94% faster) will. – rwong Sep 27 '14 at 04:53

16 Answers16

32

(Disclaimer: the following deals with very low-level optimizations that are most often not necessary. If you keep reading, you waive your right to complain that computers are fast and there is never any reason to worry about this sort of thing.)

One advantage of eliminating an if statement is that you avoid branch prediction penalties.

Branch prediction penalties are generally only a problem when the branch is not easily predicted. A branch is easily predicted when it is almost always taken/not taken, or it follows a simple pattern. For example, the branch in a loop statement is taken every time except the last one, so it is easily predicted. However, if you have code like

a = random() % 10
if (a < 5)
  print "Less"
else
  print "Greater"

then this branch is not easily predicted, and will frequently incur the prediction penalty associated with clearing the cache and rolling back instructions that were executed in the wrong part of the branch.

One way to avoid these kinds of penalties is to use the ternary (?:) operator. In simple cases, the compiler will generate conditional move instructions rather than branches.

So

int a, b, low;
if (a < b) low = a;
else low = b;

becomes

int a, b, low;
low = (a < b) ? a : b

and in the second case a branching instruction is not necessary. Additionally, it is much clearer and more readable than your bit-twiddling implementation.

Of course, this is a micro-optimization which is unlikely to have significant impact on your code.

danben
  • 80,905
  • 18
  • 123
  • 145
  • 26
    Finally, an answer that isn't bleating about premature optimization. Thank you. – Justicle Jun 09 '10 at 20:48
  • 11
    @Justicle - the problem with not bleating about premature optimization is that you end up with an implied suggestion (particularly to people who are just learning) that one should write code like `low = b + ((a - b) & ((a - b) >> 31))` everywhere for no good reason because someone said "it's faster". When, in fact, it's the wrong thing to do the vast majority of times. – Michael Burr Jun 09 '10 at 20:56
  • 13
    At `-O1` and higher, gcc produces identical code for the if statement and the ternary operator for the min() function, using a cmovg instruction in both cases. At `-O0`, it uses branches and labels for the if statement and cmovle for the ternary operator. – Adam Rosenfield Jun 09 '10 at 21:08
  • 1
    I agree that this is more readable, but it will certainly not be faster. See my answer. – BlueRaja - Danny Pflughoeft Jun 09 '10 at 21:21
  • "and in the second case a branching instruction is not necessary." +1 for binning in the branching overhead into the picture. Otherwise one has to keep in mind that modern CPUs (Intel ones included) have "conditional" operations which remove need for branching in trivial cases like that. – Dummy00001 Jun 10 '10 at 15:39
  • 1
    "However after running experiments on a wide range of compilers I have concluded that with the optimizer turned on, you are better off with a simple if-else statement." [Efficient C Tips #6 – Don’t use the ternary operator](http://embeddedgurus.com/stack-overflow/2009/02/efficient-c-tips-6-dont-use-the-ternary-operator/) – endolith Apr 02 '15 at 04:11
  • This assumes that `if(){}else{}` compiles to a branch. It's not always obvious for compilers whether to make branchy or branchless code, especially without profile-guided optimization. See [gcc optimization flag -O3 makes code slower then -O2](https://stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-then-o2) for a case where gcc chooses wrong and `-O3` is slower, but `-O3 -fprofile-generate` + run it + `-O3 -fprofile-use` is fast again (using a branch to keep the loop-carried dep chain short). – Peter Cordes Oct 22 '17 at 16:28
10

Simple answer: One conditional jump is going to be more efficient than two subtractions, one addition, a bitwise and, and a shift operation combined. I've been sufficiently schooled on this point (see the comments) that I'm no longer even confident enough to say that it's usually more efficient.

Pragmatic answer: Either way, you're not paying nearly as much for the extra CPU cycles as you are for the time it takes a programmer to figure out what that second example is doing. Program for readability first, efficiency second.

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
  • 1
    @nategoose: Which processors? – Bill the Lizard Jun 09 '10 at 21:41
  • 2
    @Bill: many processors have a long instruction pipeline which must be flushed whenever there's a mispredicted branch, taking perhaps 10 or 20 cycles. In this case, the branch is likely to be mispredicted half the time, so the conditional version might take an average of 5 or 10 cycles, while the squiggly version takes 4 or 5. (Of course, other processors have conditional instructions, short pipelines and other ways to avoid misprediction, and then the conditional version will be faster). – Mike Seymour Jun 09 '10 at 23:09
  • And on the processor I mostly use, the first version takes 2 cycles, and the second takes 3. – Mike Seymour Jun 09 '10 at 23:33
  • 2
    On the in-order PowerPC processor used in many game consoles, an unpredicted branch is a 20 cycle bubble, and a *correctly* predicted branch is a 5 cycle bubble. x + ((y - x) & (a >> 31)) is 3 cycles due to dual-dispatch. The situation is even more extreme for floating point numbers, where the conditional-move has a throughput of 1/1 cycle, whereas branch on float comparison can be a *40* cycle bubble. – Crashworks Jun 09 '10 at 23:39
  • @nategoose, @Mike, @Crashworks: Well, that will teach me to make sweeping generalizations based on benchmarks from one machine. I stand corrected. – Bill the Lizard Jun 10 '10 at 01:31
  • @Bill the Lizard: The most extreme example would be most GPUs, I believe. On the other hand if memory serves there are some (but not all) conditional branch instructions on the AVR microcontrollers that take 3 cycles, which would take the same number of cycles as an 8 bit add, bitwise and, and shift by 1. – nategoose Jun 10 '10 at 03:49
  • Shameless plug: See [here](http://www.blueraja.com/blog/285/branchless-conditionals-compiler-optimization-technique) for more information :) – BlueRaja - Danny Pflughoeft May 14 '11 at 18:51
  • Obviously if the hand-coded branchless min was better than what gcc does with `cmov`, you'd wrap it in a `branchless_min()` function instead of writing out that obscure mess everywhere. (It's not, so you wouldn't, but if you had a compiler that was branch-happy, you can could solve the readability problem pretty trivially with a wrapper function.) – Peter Cordes Oct 22 '17 at 16:08
9

Compiling this on gcc 4.3.4, amd64 (core 2 duo), Linux:

int foo1(int a, int b)
{
    int low;
    if (a < b) low = a;
    else low = b;
    return low;
}

int foo2(int a, int b)
{
    int low;
    low = b + ((a - b) & ((a - b) >> 31));
    return low;
}

I get:

foo1:
    cmpl    %edi, %esi
    cmovle  %esi, %edi
    movl    %edi, %eax
    ret

foo2:
    subl    %esi, %edi
    movl    %edi, %eax
    sarl    $31,  %eax
    andl    %edi, %eax
    addl    %esi, %eax
    ret

...which I'm pretty sure won't count for branch predictions, since the code doesn't jump. Also, the non if-statement version is 2 instructions longer. I think I will continue coding, and let the compiler do it's job.

Thanatos
  • 42,585
  • 14
  • 91
  • 146
  • 1
    You're correct, `cmovcc` is a data dependency, not a branch-predicted control dependency. This can be good, but can also be bad if a branch would have predicted well and broken a loop-carried dependency chain. Use profile-guided optimization to help compilers choose between branchy and branchless. – Peter Cordes Oct 22 '17 at 15:12
7

The biggest problem is that your second example won't work on 64-bit machines.

However, even neglecting that, modern compilers are smart enough to consider branchless prediction in every case possible, and compare the estimated speeds. So, you second example will most likely actually be slower

There will be no difference between the if statement and using a ternary operator, as even most dumb compilers are smart enough to recognize this special case.


[Edit] Because I think this is such an interesting topic, I've written a blog post on it.

BlueRaja - Danny Pflughoeft
  • 84,206
  • 33
  • 197
  • 283
  • 3
    I've looked at the assembly output of MSVC and GCC, and neither of them seem smart enough to emit branchless conditional moves half the time I want them to. – Crashworks Jun 09 '10 at 23:40
  • @Crashworks: That means the compiler decided the branchless conditional is actually slower (branchless conditionals require more clocks, but don't have the possibility of clearing the instruction pipeline) – BlueRaja - Danny Pflughoeft Jun 10 '10 at 00:00
  • The second example won't even work on all 32 bit machines, since the behaviour of the `>>` operator on negative values is implementation-defined (it's allowable for the compiler to implement it with a logical shift rather than arithmetic shift). – caf Jun 10 '10 at 00:01
  • 6
    Yes, but the compiler was wrong when it decided that. I've timed both pathways. My job consists of cramming more work into 16.6 milliseconds than the competing product can. In general, I have seen compilers emit many suboptimal code sequences. They are not perfect. – Crashworks Jun 10 '10 at 00:03
  • @Crashworks: In that case you should take your most critical portions and rewrite them in highly-optimized assembly – BlueRaja - Danny Pflughoeft Jun 10 '10 at 01:44
  • 5
    I sometimes do, but it's often easier to meet the compiler halfway and write code in such a way that it results in the code sequence I want; intrinsics in particular are an example of this. That's much easier to intermingle with other C++ code than inline assembly. It's a common practice in the embedded world; part of the job is learning what the compiler will emit for particular inputs. – Crashworks Jun 10 '10 at 01:52
  • @Crashworks: the scenario you're describing seems like it's one that's actually worthy of worrying about the optimization. The other 99+% of the cases there's no value in it, so the code should be written to be and correct and readable (which may well still involve the conditional operator). – Michael Burr Jun 11 '10 at 20:32
  • 3
    In practice I wrote an `isel(a,b,c)` function which has the same effect as `return a >= 0 ? b : c` . We just use that. (It was named by analogue to the `fsel` intrinsic, which is the hardware's native floating point conditional-move.) It would be better if the compiler were just smart enough to emit the right code for `?:`, but we haven't got a smart compiler, just GCC. – Crashworks Jun 11 '10 at 20:52
7

Like with any low-level optimization, test it on the target CPU/board setup.

On my compiler (gcc 4.5.1 on x86_64), the first example becomes

cmpl    %ebx, %eax
cmovle  %eax, %esi

The second example becomes

subl    %eax, %ebx
movl    %ebx, %edx
sarl    $31, %edx
andl    %ebx, %edx
leal    (%rdx,%rax), %esi

Not sure if the first one is faster in all cases, but I would bet it is.

Cubbi
  • 46,567
  • 13
  • 103
  • 169
4

Either way, the assembly will only be a few instructions and either way it'll take picoseconds for those instructions to execute.

I would profile the application an concentrate your optimization efforts to something more worthwhile.

Also, the time saved by this type of optimization will not be worth the time wasted by anyone trying to maintain it.

For simple statements like this, I find the ternary operator very intuitive:

low = (a < b) ? a : b;

Clear and concise.

Ben S
  • 68,394
  • 30
  • 171
  • 212
4

For something as simple as this, why not just experiment and try it out?

Generally, you'd profile first, identify this as a hotspot, experiment with a change, and view the result.

I wrote a simple program that compares both techniques passing in random numbers (so that we don't see perfect branch prediction) with Visual C++ 2010. The difference between the approaches on my machine for 100,000,000 iteration? Less than 50ms total, and the if version tended to be faster. Looking at the codegen, the compiler successfully converted the simple if to a cmovl instruction, avoiding a branch altogether.

Michael
  • 54,279
  • 5
  • 125
  • 144
2

One thing to be wary of when you get into really bit-fiddly kinds of hacks is how they may interact with compiler optimizations that take place after inlining. For example, the readable procedure

int foo (int a, int b) {
   return ((a < b) ? a : b);
}

is likely to be compiled into something very efficient in any case, but in some cases it may be even better. Suppose, for example, that someone writes

int bar = foo (x, x+3);

After inlining, the compiler will recognize that 3 is positive, and may then make use of the fact that signed overflow is undefined to eliminate the test altogether, to get

int bar = x;

It's much less clear how the compiler should optimize your second implementation in this context. This is a rather contrived example, of course, but similar optimizations actually are important in practice. Of course you shouldn't accept bad compiler output when performance is critical, but it's likely wise to see if you can find clear code that produces good output before you resort to code that the next, amazingly improved, version of the compiler won't be able to optimize to death.

dfeuer
  • 48,079
  • 5
  • 63
  • 167
  • Kinda obvius that (x+3 > x) so ofc it should optimize it away. – andersfylling Aug 28 '17 at 08:07
  • 1
    @andersfylling: Hardly. With `unsigned x`, where overflow is defined to wrap around, `x+3 > x` isn't true for all possible inputs, so the optimization isn't safe [and you get `lea / cmp / cmov` from gcc and clang for x86-64](https://godbolt.org/g/VXZDje). Hmm, compilers could shorten the critical path by comparing `x` against constant (`UINT_MAX - 3`) so it could run in parallel with the `lea`. – Peter Cordes Oct 22 '17 at 15:41
1

One thing I will point out that I haven't noticed mention that an optimization like this can easily be overwhelmed by other issues. For example, if you are running this routine on two large arrays of numbers (or worse yet, pairs of number scattered in memory), the cost of fetching the values on today's CPUs can easily stall the CPU's execution pipelines.

Torlack
  • 4,395
  • 1
  • 23
  • 24
  • This is a comment at best, not an answer. A branch mispredict can reduce the throughput of other slow stuff; OOO execution can't hide the latency of a branch miss if the cache-miss load doesn't even start until after the branch is correctly resolved. – Peter Cordes Oct 22 '17 at 15:22
1

I'm just wondering which one of these would be more efficient (or if the difference is to miniscule to be relevant), and the efficiency of if-else statements versus alternatives in general.

Desktop/server CPUs are optimized for pipelining. Second is theoretically faster because CPU doesn't have to branch and can utilize multiple ALUs to evaluate parts of expression in parallel. More non-branching code with intermixed independent operations are best for such CPUs. (But even that is negated now by modern "conditional" CPU instructions which allow to make the first code branch-less too.)

On embedded CPUs branching if often less expensive (relatively to everything else), nor they have many spare ALUs to evaluate operations out-of-order (that's if they support out-of-order execution at all). Less code/data is better - caches are small too. (I have even seen uses of buble-sort in embedded applications: the algorithm uses least of memory/code and fast enough for small amounts of information.)

Important: do not forget about the compiler optimizations. Using many tricks, the compilers sometimes can remove the branching themselves: inlining, constant propagation, refactoring, etc.

But in the end I would say that yes, the difference is minuscule to be relevant. In long term, readable code wins.

The way things go on the CPU front, it is more rewarding to invest time now in making the code multi-threaded and OpenCL capable.

Dummy00001
  • 16,630
  • 5
  • 41
  • 63
0

Why low = a; in the if and low = a; in the else? And, why 31? If 31 has anything to do with CPU word size, what if the code is to be run on a CPU of different size?

The if..else way looks more readable. I like programs to be as readable to humans as they are to the compilers.

vpit3833
  • 7,817
  • 2
  • 25
  • 25
  • If the non-portable implementation was actually useful, you'd obviously wrap it in a `branchless_min()` function instead of manually inlining it everywhere. And yes it assumes 32-bit 2's complement signed integer + arithmetic right shifts. Of course it's not actually useful because compilers generate better branchless code using cmov, but this still doesn't answer the question. – Peter Cordes Oct 22 '17 at 15:29
0

profile results with gcc -o foo -g -p -O0, Solaris 9 v240

 %Time Seconds Cumsecs  #Calls   msec/call  Name
  36.8    0.21    0.21 8424829      0.0000  foo2
  28.1    0.16    0.37       1    160.      main
  17.5    0.10    0.4716850667      0.0000  _mcount
  17.5    0.10    0.57 8424829      0.0000  foo1
   0.0    0.00    0.57       4      0.      atexit
   0.0    0.00    0.57       1      0.      _fpsetsticky
   0.0    0.00    0.57       1      0.      _exithandle
   0.0    0.00    0.57       1      0.      _profil
   0.0    0.00    0.57    1000      0.000   rand
   0.0    0.00    0.57       1      0.      exit

code:

int
foo1 (int a, int b, int low)        
{
   if (a < b) 
      low = a;   
   else 
      low = b;         
   return low;
}

int                       
foo2 (int a, int b, int low)
{
   low = (a < b) ? a : b;
   return low;
}


int main()
{
    int low=0;
    int a=0;
    int b=0;
    int i=500;
    while (i--)
    {
       for(a=rand(), b=rand(); a; a--)
       {
           low=foo1(a,b,low);
           low=foo2(a,b,low);
       }        
    }
    return 0;
}

Based on data, in the above environment, the exact opposite of several beliefs stated here were not found to be true. Note the 'in this environment' If construct was faster than ternary ? : construct

jim mcnamara
  • 16,005
  • 2
  • 34
  • 51
  • 1
    However, compiling here, `gcc -O2 -S -o output.S input.c`, `foo1` and `foo2` compile to exactly the same 4 instructions. (Linux, gcc 4.3.4, amd64 (core 2 duo)) – Thanatos Jun 09 '10 at 23:12
  • That was the whole point and why "bleating" about profiling is meaningful. Thanks. – jim mcnamara Jun 09 '10 at 23:27
  • 1
    Timing with `-O0` is total nonsense, unless you're a compiler writer trying to improve the performance of debug builds. `-O0` isn't just a linear slow-down that slows down everything by some constant factor; see https://stackoverflow.com/questions/32000917/c-loop-optimization-help-for-final-assignment/32001196#32001196 – Peter Cordes Oct 22 '17 at 15:17
0

I had written ternary logic simulator not so long ago, and this question was viable to me, as it directly affects my interpretator execution speed; I was required to simulate tons and tons of ternary logic gates as fast as possible.

In a binary-coded-ternary system one trit is packed in two bits. Most significant bit means negative and least significant means positive one. Case "11" should not occur, but it must be handled properly and threated as 0.

Consider inline int bct_decoder( unsigned bctData ) function, which should return our formatted trit as regular integer -1, 0 or 1; As i observed there are 4 approaches: i called them "cond", "mod", "math" and "lut"; Lets investigate them

First is based on jz|jnz and jl|jb conditional jumps, thus cond. Its performance is not good at all, because relies on a branch predictor. And even worse - it varies, because it is unknown if there will be one branch or two a priori. And here is an example:

inline int bct_decoder_cond( unsigned bctData ) {
   unsigned lsB = bctData & 1;
   unsigned msB = bctData >> 1;
   return
       ( lsB == msB ) ? 0 : // most possible -> make zero fastest branch
       ( lsB > msB ) ? 1 : -1;
}

This is slowest version, it could involve 2 branches in worst case and this is something where binary logic fails. On my 3770k it prodices around 200MIPS on average on random data. (here and after - each test is average from 1000 tries on randomly filled 2mb dataset)

Next one relies on modulo operator and its speed is somewhere in between first and third, but is definetely faster - 600 MIPS:

inline int bct_decoder_mod( unsigned bctData ) {
    return ( int )( ( bctData + 1 ) % 3 ) - 1;
}

Next one is branchless approach, which involves only maths, thus math; it does not assume jump instrunctions at all:

inline int bct_decoder_math( unsigned bctData ) {
    return ( int )( bctData & 1 ) - ( int )( bctData >> 1 );
}

This does what is should, and behaves really great. To compare, performance estimate is 1000 MIPS, and it is 5x faster than branched version. Probably branched version is slowed down due to lack of native 2-bit signed int support. But in my application it is quite good version in itself.

If this is not enough then we can go futher, having something special. Next is called lookup table approach:

inline int bct_decoder_lut( unsigned bctData ) {
    static const int decoderLUT[] = { 0, 1, -1, 0 };
    return decoderLUT[ bctData & 0x3 ];
}

In my case one trit occupied only 2 bits, so lut table was only 2b*4 = 8 bytes, and was worth trying. It fits in cache and works blazing fast at 1400-1600 MIPS, here is where my measurement accuracy is going down. And that is is 1.5x speedup from fast math approach. That's because you just have precalculated result and single AND instruction. Sadly caches are small and (if your index length is greater than several bits) you simply cannot use it.

So i think i answered your question, on what what could branched/branchless code be like. Answer is much better and with detailed samples, real world application and real performance measurements results.

xakepp35
  • 2,878
  • 7
  • 26
  • 54
0

Updated answer taking the current (2018) state of compiler vectorization. Please see danben's answer for the general case where vectorization is not a concern.

TLDR summary: avoiding ifs can help with vectorization.

Because SIMD would be too complex to allow branching on some elements, but not others, any code containing an if statement will fail to be vectorized unless the compiler knows a "superoptimization" technique that can rewrite it into a branchless set of operations. I don't know of any compilers that are doing this as an integrated part of the vectorization pass (Clang does some of this independently, but not specificly toward helping vectorization AFAIK)

Using the OP's provided example:

int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));

Many compilers can vectorize this to be something approximately equivalent to:

__m128i low128i(__m128i a, __m128i b){
  __m128i diff, tmp;
  diff = _mm_sub_epi32(a,b);
  tmp = _mm_srai_epi32(diff, 31);
  tmp = _mm_and_si128(tmp,diff);
  return _mm_add_epi32(tmp,b);
}

This optimization would require the data to be layed out in a fashion that would allow for it, but it could be extended to __m256i with avx2 or __m512i with avx512 (and even unroll loops further to take advantage of additional registers) or other simd instructions on other architectures. Another plus is that these instructions are all low latency, high-throughput instructions (latencies of ~1 and reciprocal throughputs in the range of 0.33 to 0.5 - so really fast relative to non-vectorized code)

I see no reason why compilers couldn't optimize an if statement to a vectorized conditional move (except that the corresponding x86 operations only work on memory locations and have low throughput and other architectures like arm may lack it entirely) but it could be done by doing something like:

void lowhi128i(__m128i *a, __m128i *b){ // does both low and high
  __m128i _a=*a, _b=*b;
  __m128i lomask =  _mm_cmpgt_epi32(_a,_b),
  __m128i himask =  _mm_cmpgt_epi32(_b,_a);
  _mm_maskmoveu_si128(_b,lomask,a);
  _mm_maskmoveu_si128(_a,himask,b);
}

However this would have a much higher latency due to memory reads and writes and lower throughput (higher/worse reciprocal throughput) than the example above.

technosaurus
  • 7,676
  • 1
  • 30
  • 52
  • gcc and clang can do some simpler conversions of `if` into branchless. One major obstacle is that if the abstract machine doesn't write a memory location, it's not ok for the compiler-generated asm to read/rewrite it with the same value. So `_mm_maskmoveu_si128` can be correct where the other version isn't, but it's *slow* (NT store, so it evicts from cache, as well as being just plain slow). See [Is it possible to use SIMD instruction for replace?](https://stackoverflow.com/questions/48254965/is-it-possible-to-use-simd-instruction-for-replace/48285594#48285594): the AVX version is fast. – Peter Cordes Mar 13 '18 at 02:01
  • And BTW, SIMD CMOV between registers is called a blend, and is somewhat fast. Like `blendvps`. Or with AVX512, conditional move is built-in to everything with mask registers. – Peter Cordes Mar 13 '18 at 02:02
-1

Unless you're really trying to buckle down on efficiency, I don't think this is something you need to worry about.

My simple thought though is that the if would be quicker because it's comparing one thing, while the other code is doing several operations. But again, I imagine that the difference is minuscule.

Jesse Jashinsky
  • 10,313
  • 6
  • 38
  • 63
-1

If it is for Gnu C++, try this

int min = i <? j;

I have not profiled it but I think it is definitely the one to beat.

Syd
  • 1,526
  • 1
  • 15
  • 16
  • 1
    I don't know what Gnu C++ is, but I don't like its syntax. – Dennis Zickefoose Jun 09 '10 at 23:18
  • 1
    Gnu C++ is of course the C++ compiler from GCC (the Gnu Compiler Collection). IIRD they've deprecated this form. Just use `std::min(i,j)`. It's unlikely GCC's `std::min` is slower than this. – MSalters Jun 10 '10 at 09:11