Fastest way to find out minimum of 3 numbers?

Question

In a program I wrote, 20% of the time is being spent on finding out the minimum of 3 numbers in an inner loop, in this routine:

static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
    unsigned int m = a;
    if (m > b) m = b;
    if (m > c) m = c;
    return m;
}

Is there any way to speed this up? I am ok with assembly code too for x86/x86_64.

Edit: In reply to some of the comments:
* Compiler being used is gcc 4.3.3
* As far as assembly is concerned, I am only a beginner there. I asked for assembly here, to learn how to do this. :)
* I have a quad-core Intel 64 running, so MMX/SSE etc. are supported.
* It's hard to post the loop here, but I can tell you it's a heavily optimized implementation of the levenshtein algorithm.

This is what the compiler is giving me for the non-inlined version of min:

.globl min
    .type   min, @function
min:
    pushl   %ebp
    movl    %esp, %ebp
    movl    8(%ebp), %edx
    movl    12(%ebp), %eax
    movl    16(%ebp), %ecx
    cmpl    %edx, %eax
    jbe .L2
    movl    %edx, %eax
.L2:
    cmpl    %ecx, %eax
    jbe .L3
    movl    %ecx, %eax
.L3:
    popl    %ebp
    ret
    .size   min, .-min
    .ident  "GCC: (Ubuntu 4.3.3-5ubuntu4) 4.3.3"
    .section    .note.GNU-stack,"",@progbits

The inlined version is within -O2 optimized code (even my markers mrk = 0xfefefefe, before and after the call to min()) are getting optimized away by gcc, so I couldn't get hold of it.

Update: I tested the changes suggested by Nils, ephemient, however there's no perceptible performance boost I get by using the assembly versions of min(). However, I get a 12.5% boost by compiling the program with -march=i686, which I guess is because the whole program is getting the benefits of the new faster instructions that gcc is generating with this option. Thanks for your help guys.

P.S. - I used the ruby profiler to measure performance (my C program is a shared library loaded by a ruby program), so I could get time spent only for the top-level C function called by the ruby program, which ends up calling min() down the stack. Please see this question.

Look at what assembly is generated for that routine, and see if you can find a way to optimize that. — Anon., Jan 11 '10 at 03:23
Can you post the assembly that your compiler generates? Without seeing that, it's hard to know if it's possible to go faster. — Stephen Canon, Jan 11 '10 at 03:24
Also, how is this being used? Some optimizations, such as vector operations, can only be applied in certain circumstances. And what level of CPU support can we expect? (SSE3? 4.1?) — bdonlan, Jan 11 '10 at 03:30
can you post the loop that this occurs in? it might be possible to optimize in the context of the loop. — Larry Watanabe, Jan 11 '10 at 03:31
How trivial is the program if this is 20% of it? Sounds like homework question to me. — No Refunds No Returns, Jan 11 '10 at 03:35
How did you come to the 20% figure? Are you using oprofile to discover it? — Free Wildebeest, Oct 12 '11 at 11:47

Stephen Canon · Answer 1 · 2010-01-11T06:32:42.697

11

Assuming your compiler isn't out to lunch, this should compile down to two compares and two conditional moves. It isn't possible to do much better than that.

If you post the assembly that your compiler is actually generating, we can see if there's anything unnecessary that's slowing it down.

The number one thing to check is that the routine is actually getting inlined. The compiler isn't obligated to do so, and if it's generating a function call, that will be hugely expensive for such a simple operation.

If the call really is getting inlined, then loop unrolling may be beneficial, as DigitalRoss said, or vectorization may be possible.

Edit: If you want to vectorize the code, and are using a recent x86 processor, you will want to use the SSE4.1 pminud instruction (intrinsic: _mm_min_epu32), which takes two vectors of four unsigned ints each, and produces a vector of four unsigned ints. Each element of the result is the minimum of the corresponding elements in the two inputs.

I also note that your compiler used branches instead of conditional moves; you should probably try a version that uses conditional moves first and see if that gets you any speedup before you go off to the races on a vector implementation.

edited Jan 11 '10 at 06:32

answered Jan 11 '10 at 03:26

Stephen Canon

103,815
19
183
269

3

+1 My guess is that any gains will come from the outer context, versus this function. – Michael Easter Jan 11 '10 at 03:30
The outer context is heavily optimized. It does computations over a database of 2.88 million strings. Before optimizations it used to give results in 4 seconds. After a week of heavy optimizations, this is down to 150 ms. Latest profile run turns up "min" on top with 20% time spent there. – Sudhanshu Jan 11 '10 at 03:56
2

My only comment would be looking into what's calling min all the time, and see if you can save calls to min itself. – Bill Lynch Jan 11 '10 at 04:06
Loop unrolling is one of the optimizations already there, along with several others. The routine is getting inlined, I can't find the "min" symbol in disassembled code. I am intrigued about the vectorization bit - maybe should go and read up on that. Thanks. – Sudhanshu Jan 11 '10 at 04:09

score 11 · Accepted Answer · answered Jan 11 '10 at 05:29

11

Make sure you are using an appropriate -march setting, first off. GCC defaults to not using any instructions that were not supported on the original i386 - allowing it to use newer instruction sets can make a BIG difference at times! On -march=core2 -O2 I get:

min:
    pushl   %ebp
    movl    %esp, %ebp
    movl    8(%ebp), %edx
    movl    12(%ebp), %ecx
    movl    16(%ebp), %eax
    cmpl    %edx, %ecx
    leave
    cmovbe  %ecx, %edx
    cmpl    %eax, %edx
    cmovbe  %edx, %eax
    ret

The use of cmov here may help you avoid branch delays - and you get it without any inline asm just by passing in -march. When inlined into a larger function this is likely to be even more efficient, possibly just four assembly operations. If you need something faster than this, see if you can get the SSE vector operations to work in the context of your overall algorithm.

answered Jan 11 '10 at 05:29

bdonlan

224,562
31
268
324

+1 for the -march suggestion. I get a boost of 12.5% by just using that. :) – Sudhanshu Jan 12 '10 at 07:18
Obviously you want this to inline in real life, not pass args on the stack to a stand-alone function. But if not, you'd want to use `-fomit-frame-pointer`. (Which is on by default even for 32-bit code, in more recent GCC versions.) – Peter Cordes Oct 28 '20 at 21:21
On Skylake, note that `cmovbe` is unfortunately still 2 uops, because it needs both ZF and CF. CMOVcc that reads only CF, or only flags from the SPAZO group, is only a single uop, so `cmovb` would be better. (It doesn't matter whether you move or not on equal). See [this Q&A](https://stackoverflow.com/questions/49867597/what-is-a-partial-flag-stall). – Peter Cordes Oct 28 '20 at 21:23

Nils Pipenbrinck · Answer 3 · 2010-01-11T04:48:16.923

6

My take on an x86 assembler implementation, GCC syntax. Should be trivial to translate to another inline assembler syntax:

int inline least (int a, int b, int c)
{
  int result;
  __asm__ ("mov     %1, %0\n\t"
           "cmp     %0, %2\n\t" 
           "cmovle  %2, %0\n\t"
           "cmp     %0, %3\n\t"
           "cmovle  %3, %0\n\t" 
          : "=r"(result) : 
            "r"(a), "r"(b), "r"(c)
          );
  return result;
}

New and improved version:

int inline least (int a, int b, int c)
{
  __asm__ (
           "cmp     %0, %1\n\t" 
           "cmovle  %1, %0\n\t"
           "cmp     %0, %2\n\t"
           "cmovle  %2, %0\n\t" 
          : "+r"(a) : 
            "%r"(b), "r"(c)
          );
  return a;
}

NOTE: It may or may not be faster than C code.

This depends on a lot of factors. Usually cmov wins if the branches are not predictable (on some x86 architectures) OTOH inline assembler is always a problem for the optimizer, so the optimization penalty for the surrounding code may outweight all gains..

Btw Sudhanshu, it would be interesting to hear how this code performs with your testdata.

edited Jan 11 '10 at 04:48

answered Jan 11 '10 at 04:20

Nils Pipenbrinck

83,631
31
151
221

Does this work for unsigned integer comparisons as well? Sorry, if this sounds naive. – Sudhanshu Jan 11 '10 at 04:37
Whoops, I didn't see this before writing my own. Yes, you can do this unsigned; just change `cmovle` to `cmovbe`. – ephemient Jan 11 '10 at 04:40
2

As mentioned in my reply below, GCC does this optimization automatically once you pass in an appropriate `-march` flag - it's just that it's not in the instruction set of the original 80386, and GCC errs on the side of (extreme) caution :) – bdonlan Jan 11 '10 at 05:39
Nils, ephemient, bdonlan - all of these suggestions look good. Lemme get back to you with the results by tomorrow. Thanks for the help. – Sudhanshu Jan 11 '10 at 06:37
GCC does not do this optimization anymore. The optimization is still in GCC but it is disabled. the branching version is used instead. Reason: The compiler has a hard time to guess if a branch is predictable or not, and to make sure the branch prediction gets used it does not use cmovcc. – Nils Pipenbrinck Jan 11 '10 at 15:32
No, bdonlan is right. I didn't use `-march=` when I was testing earlier, but on my AMD Phenom, `-march=native` does use CMOV. (On the other hand, `-march=core2` doesn't.) – ephemient Jan 11 '10 at 16:31
You're missing an early-clobber of the `"=&r"` output. If the compiler picks the same register as one of the inputs, you're screwed. Also, don't start an asm template with `mov`; tell the compiler what the initial value should be, using an assignment and a `"+r"(result)`, or a matching constraint for `a`. – Peter Cordes Oct 28 '20 at 21:26
(Oh, your 2nd version has the optimization I was suggesting, avoiding the hard-coded `mov`.) – Peter Cordes Oct 28 '20 at 21:37

score 6 · Answer 4 · answered Jan 11 '10 at 04:39

6

This drop-in replacement clocks in about 1.5% faster on my AMD Phenom:

static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
    asm("cmp   %1,%0\n"
        "cmova %1,%0\n"
        "cmp   %2,%0\n"
        "cmova %2,%0\n"
        : "+r" (a) : "r" (b), "r" (c));
    return a;
}

Results may vary; some x86 processors don't handle CMOV very well.

answered Jan 11 '10 at 04:39

ephemient

198,619
38
280
391

Nice.. better than my example. You can add a %-modifier for the b for some extra flexibility in the register allocation. – Nils Pipenbrinck Jan 11 '10 at 04:47
3

GCC will do this automatically with a proper `-march` setting, which will also help in other parts of the code. – bdonlan Jan 11 '10 at 05:31
Technically `"+r"` should be `"+&r"` because it's written before all the pure-inputs are read. GCC might currently choose not to have `a` and `b` share the same reg even if it knows they're the same, though. Also, on later Intel CPUs, `cmovae` is more efficient (reading only CF, not CF and ZF, so it's [only 1 uop on Skylake](https://stackoverflow.com/questions/49867597/what-is-a-partial-flag-stall) / https://uops.info.) – Peter Cordes Oct 28 '20 at 21:35

score 5 · Answer 5 · answered Jan 11 '10 at 03:50

5

The SSE2 instruction extensions contain an integer min instruction that can choose 8 minimums at a time. See _mm_mulhi_epu16 in http://www.intel.com/software/products/compilers/clin/docs/ug_cpp/comm1046.htm

answered Jan 11 '10 at 03:50

Mark Ransom

299,747
42
398
622

6

`_mm_mulhi_epu16` is an intrinsic for a vector 16 bit multiply high instruction -- not useful for computing a minimum of 32 bit integers. The intrinsic you actually want is `_mm_min_epu32`. – Stephen Canon Jan 11 '10 at 04:55
@StephenCanon That's not true, since `_mm_min_epu32` compares two packed `__m128i` values. What the OP needs is a horizontal minimum, which afaik doesn't exist in SSE. – Jakub Arnold Apr 06 '16 at 17:15
@JakubArnold: you need `_mm_min_epu32` twice, with each input in the low element of a separate vector. That can do 4 separate 3-way mins in parallel if you use the upper elements, but probably not worth the `movd` to/from XMM regs to use it for scalars if you need the result in integer regs. Otherwise worth considering; `movd` loads/stores are fine. – Peter Cordes Oct 28 '20 at 21:29
Or you need SSE4.1 `_mm_minpos_epu16` to do a horizontal unsigned min of one vector, but that's for 16-bit elements. `_mm_mulhi_epu16` doesn't seem useful at all, though; that's a high-half 16-bit multiply. (`pmulhuw`) – Peter Cordes Oct 28 '20 at 21:31

score 1 · Answer 6 · answered Jan 11 '10 at 03:39

First, look at the disassembly. That'll tell you a lot. For example, as written, there are 2 if-statements (which means there are 2 possible branch mispredictions), but my guess is that a decent modern C compiler will have some clever optimization that can do it without branching. I'd be curious to find out.

Second, if your libc has special built-in min/max functions, use them. GNU libc has fmin/fmax for floating-point, for example, and they claim that "On some processors these functions can use special machine instructions to perform these operations faster than the equivalent C code". Maybe there's something similar for uints.

Finally, if you're doing this to a bunch of numbers in parallel, there are probably vector instructions to do this, which could provide significant speedup. But I've even seen non-vector code be faster when using vector units. Something like "load one uint into a vector register, call vector min function, get result out" looks dumb but might actually be faster.

Thanks for your pointers Ken - I'll definitely check out the vector instructions, that I think Mark and Stephen are referring to as well. — Sudhanshu, Jan 11 '10 at 04:30

score 0 · Answer 7 · answered Jan 11 '10 at 03:26

0

If you are only doing one comparison you might want to unroll the loop manually.

First, see if you can get the compiler to unroll the loop for you, and if you can't, do it yourself. This will at least reduce the loop control overhead...

answered Jan 11 '10 at 03:26

DigitalRoss

143,651
25
248
329

Paul Sasik · Answer 8 · 2010-01-11T03:42:30.153

0

You could try something like this to save on declaration and unnecessary comparisons:

static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{ 
    if (a < b)
    {
        if (a < c) 
             return a; 
        else 
             return c;
    }

    if (b < c)
        return b;
    else return c;
}

edited Jan 11 '10 at 03:42

answered Jan 11 '10 at 03:29

Paul Sasik

79,492
20
149
189

I doubt this'll be much better - the initial assignment is going to be turned into a rename in the compiler anyway, and now there's three branches taking up space in the branch predictor, not two. – bdonlan Jan 11 '10 at 03:31
1

This is two comparisons either way. The difference now is that you're branching instead of using conditional moves - I would guess this is likely to be slower. Even ignoring that you're tubing the pipeline. – Anon. Jan 11 '10 at 03:32
1

I think this computes the max of 3 inputs, not the min. At least for a = 5, b = 2, c = 3 – Michael Easter Jan 11 '10 at 03:32
1

Be careful here. Now there are extra branches and the resulting code is larger, both of which have their own downsides. (Also, this is max but it's clear what you meant.) – jason Jan 11 '10 at 03:32
@Anon: The compiler's not guaranteed to use conditional moves in the former case - but I agree, it's certainly going to have difficulty doing it here. – bdonlan Jan 11 '10 at 03:33
@Anon: the compiler is free to branch or use conditional moves in either case. (I suspect most compilers will emit conditional moves for both) – Stephen Canon Jan 11 '10 at 03:34
The original code always ends up with at least two comparisons and two assignmetns (possibly three assignments.) My version should result in a max of two comparisons. Now... i realize it's actually more code. This snippet as an alternative that MIGHT be executed faster or not. It's only a few lines of code that might be worth testing... – Paul Sasik Jan 11 '10 at 03:48
You always need two comparisons, since you have to examine at least two pairs of variables. If you look closely, both versions actually always do exactly two comparisons as well. And modern compilers will rename variables to eliminate the cost of local assignments in cases like this - I highly doubt it'll be any better... – bdonlan Jan 11 '10 at 03:53
3

Assignments are cheap. Seriously. Unless you have to hit memory, they're far cheaper than a missed branch. – Anon. Jan 11 '10 at 03:57

score 0 · Answer 9 · edited May 23 '17 at 11:45

0

These are all good answers. At the risk of being accused of not answering the question, I would also look at the other 80% of the time. Stackshots are my favorite way to find code worth optimizing, especially if it is function calls that you find out you don't absolutely need.

edited May 23 '17 at 11:45

Community

1
1

answered Jan 11 '10 at 21:30

Mike Dunlavey

40,059
14
91
135

score -1 · Answer 10 · answered Jan 11 '10 at 03:30

-1

Yes, post assembly, but my naive optimization is:

static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
    unsigned int m = a;
    if (m > b) m = b;
    if (m > c) return c;
    return m;
}

answered Jan 11 '10 at 03:30

Hamish Grubijan

10,562
23
99
147

4

Transformations of this nature can be done by just about any compiler (and it's non-trivial to say which form would be more efficient!) – bdonlan Jan 11 '10 at 03:32

Fastest way to find out minimum of 3 numbers?

10 Answers10

Linked

Related