Convert GCC Inline Assembly CMOV to Visual Studio Assembler

Question

In the article Linear vs. Binary Search, there is a fast implementation of binary search which uses the CMOV instruction. I would like to implement this in VC++ as the application I'm working on relies on the performance of Binary Search.

The implementation has some GCC inline assembler which is declared as follows:

static int binary_cmov (const int *arr, int n, int key) {
    int min = 0, max = n;
    while (min < max) {
            int middle = (min + max) >> 1;

            asm ("cmpl %3, %2\n\tcmovg %4, %0\n\tcmovle %5, %1"
                 : "+r" (min),
                   "+r" (max)
                 : "r" (key), "g" (arr [middle]),
                   "g" (middle + 1), "g" (middle));

            // Equivalent to 
            // if (key > arr [middle])
            //    min = middle + 1;
            // else
            //    max = middle;
    }
    return min;
}

I'd like to convert this GCC Assembler to Microsoft Visual Studio compatible assembler, but as a GCC noob wouldn't know where to start.

Can anyone help, or at least explain the GCC Inline Assembler? TIA!

Short answer: Don't do it. Keep it in C and let the compiler take care of it. — Jester, Jun 12 '16 at 20:34
Whenever I see someone drop down to inline asm, my first though is "that guy is doing it wrong". Sure, there are *rare* cases where it makes sense, but in 99% of cases one should just write readable C or C++ and let the compiler deal with the details (It'll usually do a better job). — Jesper Juhl, Jun 12 '16 at 20:38
Hi Jesper, good point, however for a performance critical algorithm there is a good case for optimization. If you read the linked article above, Binary Search with CMOV outperforms without CMOV by a factor of ~2. I have viewed the assembly output by Visual Studio for the simple while loop and it does not generate the CMOV instruction but instead generates a branch. It is this unpredictable branch that results in the factor of 2 performance improvement when using CMOV. — Dr. Andrew Burnett-Thompson, Jun 12 '16 at 20:43
Have you actually profiled your code to verify that your binary search function is in fact a critical bottleneck? Do you know that the branches aren't being predicted correctly? Are you searching a "small to medium sized sorted arrays of integers" just like in article you linked? If you can't say yes to all that then there's no point in trying to port this code. — Ross Ridge, Jun 12 '16 at 21:33
Have you tried it with msvc 2015 u3? It should generate cmovs for that code in your case (via new optimizer) — Yakk - Adam Nevraumont, Jun 12 '16 at 21:36
@Dr.ABT Possibly you need to make sure that the compiler is allowed to generate pentium instructions. — fuz, Jun 12 '16 at 21:45
A faster sequence is possibly `cmpl min,max; adc $0,middle`. Not sure if it is correct though, likely you need to do `cmpl max,min` instead. — fuz, Jun 12 '16 at 21:47
@Yakk - very interesting. I am using VS2013. You mention msvc 2015 update 3? Only update 2 is out. is that what you meant? — Dr. Andrew Burnett-Thompson, Jun 12 '16 at 23:08
Update 3 is currently Release Candidate (As of June 7th), but is available for download. https://www.visualstudio.com/downloads/visual-studio-prerelease-downloads — Michael Petch, Jun 12 '16 at 23:10
@RossRidge yes, I have profiled, it is a critical bottleneck. Yes, its sorted values. Yes. I have profiled it. I have already gained a factor of 100x speedup by moving from C# generics to C++ templates and P/Invoke. An additional 2 fold would be nice ;) — Dr. Andrew Burnett-Thompson, Jun 12 '16 at 23:19
Brw, do try `std::sort` as well. It is both the kind of thing library writers optimize, and the thing compilers writers profile. — Yakk - Adam Nevraumont, Jun 12 '16 at 23:37
Obviously the values are sorted. Are you searching a **small to medium sized arrays of integers** where small to medium is less than the maximum 65536 elements tested in the article? If you're not searching integers then the assembly code **won't work**. If you're not searching contiguous arrays of integers that fit entirely within the L1 or L2 cache then memory is likely a bigger bottleneck. — Ross Ridge, Jun 13 '16 at 00:13
Hi Ross, typical array size is 1024. The idea is to test out this alternative algorithm and benchmark it, which is a valid approach. I have done my homework! :) — Dr. Andrew Burnett-Thompson, Jun 13 '16 at 07:54
To make the question *title* less confusing, please understand that [Visual C++ x64 Compiler has no support for inline assembly](https://msdn.microsoft.com/en-us/library/wbk4z78b.aspx). Thus, this question is about maximizing performance in Visual C++ using C++ code and/or intrinsics (as explained in the link). Inline assembly is already out of question. — rwong, Aug 19 '16 at 16:54

Richard Hodges · Accepted Answer · 2016-06-12T22:39:30.773

4

refactoring the code in order to better express intent to the compiler:

int binary_cmov (const int *arr, int n, int key) 
{
    int min = 0, max = n;
    while (min < max) 
    {
      int middle = (min + max) >> 1;
      min = key > arr[middle] ? middle + 1 : min;
      max = key > arr[middle] ? max : middle;
    }
    return min;
}

With gcc5.3 -O3, yields:

binary_cmov(int const*, int, int):
        xorl    %eax, %eax
        testl   %esi, %esi
        jle     .L4
.L3:
        leal    (%rax,%rsi), %ecx
        sarl    %ecx
        movslq  %ecx, %r8
        leal    1(%rcx), %r9d
        movl    (%rdi,%r8,4), %r8d
        cmpl    %edx, %r8d
        cmovl   %r9d, %eax
        cmovge  %ecx, %esi
        cmpl    %eax, %esi
        jg      .L3
        rep ret
.L4:
        rep ret

moral of the story - don't embed assembler. All you do is make the code non-portable.

going further...

why not express intent even more explicitly?

#include <utility>

template<class...Ts>
  auto sum(Ts...ts)
{
  std::common_type_t<Ts...> result = 0;
  using expand = int[];
  void(expand{ 0, ((result += ts), 0)... });
  return result;
}

template<class...Ts>
  auto average(Ts...ts)
{
  return sum(ts...) / sizeof...(Ts);
}

int binary_cmov (const int *arr, int n, int key) 
{
    int min = 0, max = n;
    while (min < max) 
    {
      int middle = average(min, max);
      auto greater = key > arr[middle];
      min = greater ? middle + 1 : min;
      max = greater ? max : middle;
    }
    return min;
}

compiler output:

binary_cmov(int const*, int, int):
        xorl    %eax, %eax
        testl   %esi, %esi
        jle     .L4
.L3:
        leal    (%rax,%rsi), %ecx
        movslq  %ecx, %rcx
        shrq    %rcx
        movslq  %ecx, %r8
        leal    1(%rcx), %r9d
        movl    (%rdi,%r8,4), %r8d
        cmpl    %edx, %r8d
        cmovl   %r9d, %eax
        cmovge  %ecx, %esi
        cmpl    %eax, %esi
        jg      .L3
        rep ret
.L4:
        rep ret

edited Jun 12 '16 at 22:39

answered Jun 12 '16 at 22:22

Richard Hodges

68,278
7
90
142

1

Just an observation. The OP linked to an article that was written in 2010, and it may well have been possible that the _GCC_ compiler back then didn't do as well with such optimizations. I've seen plenty of inline assembler code originally written to overcome problems with poorer code generation of earlier compilers. Often that code is never reviewed later on as compilers developed better optimization techniques. – Michael Petch Jun 12 '16 at 22:29
@MichaelPetch fair, but I would argue that this is a further reason to avoid hand-optimisation. Better to report the poor code generation to the compiler team and let them solve it. Solving it once in the compiler solves it for every program written - past, present and future. – Richard Hodges Jun 12 '16 at 22:37
2

I'm not disagreeing, as I said it was an observation (I was the upvoter). However, if in the past there was a requirement for there to be better speed and the code generation did output poorer results most businesses would probably want the faster code with hand optimized assembly. Most businesses won't wait for the compilers to catch up when there are deadlines. Businesses work on a different time frame than compiler developers. – Michael Petch Jun 12 '16 at 22:39
This is very interesting, as I did try this (using ternary operator) and no CMOV instruction was generated. However ... @Yakk points out in comments above that I need to try VS2015. I am using VS2013. – Dr. Andrew Burnett-Thompson Jun 12 '16 at 23:04
@MichaelPetch "Businesses work on a different time frame than compiler developers" Indeed they do :) – Dr. Andrew Burnett-Thompson Jun 12 '16 at 23:06
@MichaelPetch GCC 4.4 from 2009 generates CMOV instructions for this code. – Ross Ridge Jun 13 '16 at 00:20
@MarcGlisse you're right. it seems to be a regression. – Richard Hodges Jun 13 '16 at 19:09

Ross Ridge · Answer 2 · 2016-06-13T16:52:21.517

Making a small change to Richard Hodges' code also allows Visual Studio 2013 to use CMOV instruction:

int binary_cmov (const int *arr, int n, int key) 
{
    int min = 0, max = n;
    while (min < max) 
    {
      int middle = (min + max) >> 1;
      int middle1 = middle + 1;
      min = key > arr[middle] ? middle1 : min;
      max = key > arr[middle] ? max : middle;
    }
    return min;
}

Compiling with cl /Ox /Fa /c t286.c generates:

; Listing generated by Microsoft (R) Optimizing Compiler Version 18.00.40629.0 

binary_cmov PROC
    mov QWORD PTR [rsp+8], rbx
; Line 3
    xor eax, eax
    mov ebx, r8d
    mov r11d, edx
; Line 4
    test    edx, edx
    jle SHORT $LN9@binary_cmo
$LL2@binary_cmo:
; Line 6
    lea r10d, DWORD PTR [r11+rax]
    sar r10d, 1
; Line 8
    movsxd  r8, r10d
    lea r9d, DWORD PTR [r10+1]
    cmp ebx, DWORD PTR [rcx+r8*4]
; Line 9
    cmovg   r10d, r11d
    cmovg   eax, r9d
    mov r11d, r10d
    cmp eax, r10d
    jl  SHORT $LL2@binary_cmo
$LN9@binary_cmo:
; Line 12
    mov rbx, QWORD PTR [rsp+8]
    ret 0
binary_cmov ENDP

This also works with the Visual Studio 2015 and 2010 compilers. With Visual Studio the trick seems to be to use the ternary operators and use simple variables as the second and third operands. If you replace middle1 with middle + 1 in the first ternary operator above then Visual Studio only generates one CMOV instruction for this function. The first ternary operator generates a branch instead.

As mentioned by Yakk in a comment, the forthcoming Visual Stdio 2015 Update 3 compiler contains a major update to the optimizer that should change this behaviour, making it more likely to generate CMOV instructions where appropriate.

Interesting that you don't get a `cmov` more easily. I guess one branch is tempting for the optimizer, compared to two `cmov`s. With many compilers, using an assignment that always happens (e.g. ternary instead of `if`) is the main trick to getting a `cmov` emitted. — Peter Cordes, Jun 13 '16 at 02:41
@PeterCordes With GCC the equivalent two if statements will also end up generating CMOV instruction. Microsoft's compiler appears to needs ternary operators, though it looks like Visual Studio 2015 Update 3 is going to change that with its new SSA based optimizer. — Ross Ridge, Jun 13 '16 at 04:28

score 3 · Answer 3 · answered Jun 13 '16 at 05:34

People have already pointed out (repeatedly) the fact that you should avoid inline asm if possible. I agree with all of them.

That said, you also asked "at least explain the GCC Inline Assembler." That part may be moot, but FWIW:

Starting with the assembler template:

"cmpl %3, %2\n\tcmovg %4, %0\n\tcmovle %5, %1"

There are a couple things going on here that might look weird to a VS programmer.

cmpl - By default, gcc's asm uses at&t syntax instead of intel's. This results in a number of things being subtly different. For example, the l here indicates that the comparison is to be applied to longs (where long in this context means 4 bytes). One of the other differences is that operands are reversed. So intel might use mov eax, 1, but att would use movl $1, %eax (dollar sign indicates constant, % marks a register). There are sites that talk about all the differences here.
\n\t - gcc is going to output the code to the assembler. To put each of these 3 instruction on their own lines, you add '\n' for newline and '\t' for a tab to line things up pretty.
%0-%5 - You should think of the assembler template (the string containing the assembler) rather like the format string that you send to printf. Certain parts are printed literally, and other parts get replaced (think: printf("2 + 2 = %d\n", 2+2);). In the same way, gcc will replace parts of the template with the following parameters. %0 is the first parameter and %5 is the 6th (numbered in the order they appear).

That brings us to the rest of the command. Items that come after the first colon are output parameters:

             : "+r" (min),
               "+r" (max)

The '+' here indicates that the variables are both read and written. If they were output only, you would use '='. If you were going to read but not modify, you'd put it as a input parameter (ie after the next colon). The 'r' indicates that the values must be moved into registers before executing the asm. It leaves the decision about which register to the compiler. The programmer doesn't need to know; he can use use %0 for min and %1 for max and the token will be replace as appropriate.

Which brings us to the output parameters. Pretty much what you expect except that 'g' is a general constraint type. In theory it allows the values to be stored in a register, memory, or an immediate value. Mostly it just means register.

             : "r" (key), "g" (arr [middle]),
               "g" (middle + 1), "g" (middle));

There are pages of docs about gcc's inline asm (see https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html) that describe all this in painful detail. There are also a few pages that talk about the constraints other than 'r' and 'g' (https://gcc.gnu.org/onlinedocs/gcc/Constraints.html).

As you might imagine, it is incredibly easy to get this wrong. And if you do, you probably won't get a nice, comprehensible compiler error. Instead what will happen is that it will appear to work, then something else a dozen lines later will fail for no obvious reason.

Which is why friends don't tell friends to use inline assembler.

Hey, you asked...

I'm part way through writing up an answer explaining exactly how that specific GNU C implementation is sub-optimal. (e.g. an extra sign-extension instruction in 64bit mode from using 32bit variables and 64bit pointers, and `key` can't turn into an immediate constant; it has to waste a register). Also of course `"g"` can result in a build failure if `middle` is a compile-time constant after inlining and unrolling: `cmov` can't have an immediate source. Anyway, if you were planning on adding any of that, don't bother, since I'm planning to at some point soon. — Peter Cordes, Jun 13 '16 at 05:55
@PeterCordes - Wasn't planning on it. I did consider discussing symbolic names and the "cc" clobber, but I figure by now this poor guy already has (way) more information on inlining than he ever believed existed. It seems like "don't do it" is becoming a popular answer for the inline asm tag. — David Wohlferd, Jun 13 '16 at 07:11
A `"cc"` clobber is implicit for x86 and x86-64; no need to add confusion by bringing it up. I think when I post my answer, the main point is going to be "see how hard this was to get right? It took me a long time, and I'm fairly good at it". :P — Peter Cordes, Jun 13 '16 at 07:18
While "cc" is indeed implicit on i386, it is not documented as such (despite my efforts). As such, it is an implementation detail, not supported behavior and so (theoretically) subject to change. Unlikely (*really* unlikely), but possible. And yeah, the 'how hard it is' was sort of my point too. And the real killer is you can never be absolutely certain you've got it right. Some small and (seemingly) unrelated change elsewhere might trigger problems years later. — David Wohlferd, Jun 13 '16 at 07:55
@DavidWohlferd thank you ! :) I am realizing its incredibly easy to get this wrong. If I can achieve the same thing with a ternary operator, then I'm happy. — Dr. Andrew Burnett-Thompson, Jun 13 '16 at 07:56
@DavidWohlferd: That would be a breaking change for most existing x86 inline asm. I *think* it's viewed by the compiler devs as part of the API, not as "being forgiving of bugs". i.e. omitting it can't be considered a bug, so breaking existing code (the way the optimizer breaks C code with undefined behaviour) would not be something they could justify. It's technically possible, sure, but not practically possible. Interesting point that it doesn't seem to be officially documented anywhere; I'm pretty sure I haven't seen it in the official docs. — Peter Cordes, Jun 13 '16 at 08:38
BTW, I was [looking at the x86-64.org mailing list archives](http://stackoverflow.com/a/35619528/224132) out of curiosity, and found [the thread](http://www.x86-64.org/pipermail/discuss/2000-November/001209.html) where they decided that the amd64 backend would follow i386's choice of implicit `"cc"` clobber. — Peter Cordes, Jun 13 '16 at 08:41
I agree that this behavior is extremely unlikely to change. But I'd still rather write my code following the documented requirements than violate them and hope it never becomes a problem. — David Wohlferd, Jun 13 '16 at 09:27

Kees · Answer 4 · 2016-08-22T13:40:39.600

With VS2015 update 3, i could convince the compiler to use cmova and cmovbe by changing the second comparison from > to <=:

int binary_cmov(const int *arr, int n, int key) 
{
    int min = 0, max = n;
    while (min < max) 
    // cmp         r9,r8  
    // jb          binary_cmov+1B0h 
    {
        int middle = (min + max) >> 1;
        // lea         rdx,[r8+r9]  
        // shr         rdx,1  
        int middle1 = middle + 1;
        // lea         rax,[rdx+4]  
        min = key > arr[middle] ? middle1 : min;
        // cmp         r10,dword ptr [rdi+rdx*4]  
        // cmova       r9,rax  
        max = key <= arr[middle] ? middle : max;
        // cmovbe      r8,rdx  
    }
    return min;
}

Measuring performance on an i7-5600U with a real application (lookup of breakpoint addresses), linear search out performs binary search up to arrays of 512 entries. I think speed is limited by filling the cache and the binary search > 512 entries mainly performs better because it avoids getting a part of the table from the memory.

The code for the linear pointer search with sentinel is:

// the array must be terminated with a value that is larger than 
// any value searched for e.g. MAX_INT (the sentinel)
// the index of the equal or greater entry is returned
int find_in_terminated_array(const int *arr, int key) 
{
    int * p = arr;
    while(key > *p) +p;
    return p - arr;
 }

Convert GCC Inline Assembly CMOV to Visual Studio Assembler

4 Answers4

Linked