Generating CMOV instructions using Microsoft compilers

Question

In an effort to eke out some cmov instructions on an intel core 2 running windows 7 pro I wrote the code below. All it does is take a string from the console as input, apply some shift operations to generate a random seed, and then pass that seed on to srand, for the generation of a small array of pseudorandom numbers. The pseudorandom numbers are then evaluated for whether they satisfy the predicate function ( more arbitrary bitshuffling ), and output a '*' or a '_'. The purpose of the experiment is to generate cmov instructions, but as you can see in the disassembly below, there are none.

Any tips on how to change the code or the cflags so that they'll be generated?

#include <iostream>
#include <algorithm>
#include <string>
#include <cstdlib>

bool blackBoxPredicate( const unsigned int& ubref ) {
   return ((ubref << 6) ^ (ubref >> 2) ^ (~ubref << 2)) % 15 == 0;
}

int main() {
   const unsigned int NUM_RINTS = 32;
   unsigned int randomSeed = 1;
   unsigned int popCount = 0;
   unsigned int * rintArray = new unsigned int[NUM_RINTS];
   std::string userString;

   std::cout << "input a string to use as a random seed: ";
   std::cin >> userString;

   std::for_each( 
      userString.begin(), 
      userString.end(), 
      [&randomSeed] (char c) {
         randomSeed = (randomSeed * c) ^ (randomSeed << (c % 7));
   });

   std::cout << "seed computed: " << randomSeed << std::endl;

   srand(randomSeed);

   for( int i = 0; i < NUM_RINTS; ++i ) {
      rintArray[i] = static_cast<unsigned int> (rand());
      bool pr = blackBoxPredicate(rintArray[i]);
      popCount = (pr) ? (popCount+1) : (popCount);

      std::cout << ((pr) ? ('*') : ('_')) << " ";
   }

   std::cout << std::endl;

   delete rintArray;
   return 0;
}

And used this makefile to build it:

OUT=cmov_test.exe
ASM_OUT=cmov_test.asm
OBJ_OUT=cmov_test.obj
SRC=cmov_test.cpp
THIS=makefile

CXXFLAGS=/nologo /EHsc /arch:SSE2 /Ox /W3

$(OUT): $(SRC) $(THIS)
   cl $(SRC) $(CXXFLAGS) /FAscu /Fo$(OBJ_OUT) /Fa$(ASM_OUT) /Fe$(OUT)

clean:
   erase $(OUT) $(ASM_OUT) $(OBJ_OUT)

And yet when I went to see whether any had been generated, I saw that microsoft's compilers had generated the following assembly for that last for loop:

; 34   :       popCount = (pr) ? (popCount+1) : (popCount);
; 35   :       
; 36   :       std::cout << ((pr) ? ('*') : ('_')) << " ";

  00145 68 00 00 00 00   push    OFFSET $SG30347
  0014a 85 d2        test    edx, edx
  0014c 0f 94 c0     sete    al
  0014f f6 d8        neg     al
  00151 1a c0        sbb     al, al
  00153 24 cb        and     al, -53            ; ffffffcbH
  00155 04 5f        add     al, 95         ; 0000005fH
  00157 0f b6 d0     movzx   edx, al
  0015a 52       push    edx
  0015b 68 00 00 00 00   push    OFFSET ?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A ; std::cout
  00160 e8 00 00 00 00   call    ??$?6U?$char_traits@D@std@@@std@@YAAAV?$basic_ostream@DU?$char_traits@D@std@@@0@AAV10@D@Z ; std::operator<<<std::char_traits<char> >
  00165 83 c4 08     add     esp, 8
  00168 50       push    eax
  00169 e8 00 00 00 00   call    ??$?6U?$char_traits@D@std@@@std@@YAAAV?$basic_ostream@DU?$char_traits@D@std@@@0@AAV10@PBD@Z ; std::operator<<<std::char_traits<char> >
  0016e 46       inc     esi
  0016f 83 c4 08     add     esp, 8
  00172 83 fe 20     cmp     esi, 32            ; 00000020H
  00175 72 a9        jb  SHORT $LL3@main

For your reference, here are my cpu id strings and compiler version.

PROCESSOR_ARCHITECTURE=x86
PROCESSOR_IDENTIFIER=x86 Family 6 Model 58 Stepping 9, GenuineIntel
PROCESSOR_LEVEL=6
PROCESSOR_REVISION=3a09

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86

If you want particular instructions, don't try to get the compiler to infer them as what it will do is subject to change with version, optimization settings, flags, etc. Instead, use whatever inline assembly capability applies to that compiler, or link a genuine assembly language source file into the result. — Chris Stratton, Dec 01 '12 at 16:30
under what conditions do optimizing c++ compilers normally generate cmov instructions? this is more of an experiment than for production use; I'd like to know how to write c++ that is easy for compilers to optimize for increasing branch prediction performance. — Max DeLiso, Dec 01 '12 at 16:54
It used to be that `cmov` was slower than `cmp`+`jmp` if the branch was highly predictable, so compilers would be right to not use it often. Also, `cmov` created dependencies that cause it to run slower in a tight loop. I'm not sure if this is still the case. Maybe using PGO would encourage the compiler to do so by helping find mis-predicted branches? — Cory Nelson, Dec 01 '12 at 17:23

Cody Gray - on strike · Accepted Answer · 2021-01-23T05:16:32.513

It is extremely difficult, if not downright impossible, to get Microsoft's 32-bit C/C++ compiler to emit CMOVcc instructions.

What you have to remember is that conditional moves were first introduced with the Pentium Pro processor, and while Microsoft had a compiler switch that would tune the generated code for this 6th generation processor (the long-deprecated /G6), they never emitted code that would run exclusively on this processor. The code still needed to run on 5th generation processors (i.e., Pentium and AMD K6), so it couldn't use CMOVcc instructions because those would have generated illegal instruction exceptions. Unlike Intel's compiler, global dynamic dispatching was not (and is still not) implemented.

Also, it is worth noting that no switch was ever introduced to target exclusively 6th generation processors and later. There's no /arch:CMOV or whatever they might call it. Supported values for the /arch switch go straight from IA32 (the lowest common denominator, for which CMOV would be potentially illegal) to SSE. However, the documentation does confirm that, as one might expect, enabling SSE or SSE2 code generation implicitly enables the use of conditional-move instructions and anything else that was introduced before SSE:

In addition to using the SSE and SSE2 instructions, the compiler also uses other instructions that are present on the processor revisions that support SSE and SSE2. An example is the CMOV instruction that first appeared on the Pentium Pro revision of the Intel processors.

Therefore, in order to have any hope of getting the compiler to emit CMOV instructions, you must set /arch:SSE or higher. Nowadays, of course, this is no big deal. You can simply set /arch:SSE or /arch:SSE2 and be safe, since all modern processors support these instruction sets.

But that's only half of the battle. Even when you have the right compiler switches enabled, it is extremely difficult to get MSVC to emit CMOV instructions. Here are two important observations:

MSVC 10 (Visual Studio 2010) and earlier virtually never generate CMOV instructions. I've never seen them in the output, no matter how many variations of source code I've tried. I say "virtually" because there might be some crazy edge case that I missed, but I very much doubt it. None of the optimization flags have any effect on this.

However, MSVC 11 (Visual Studio 2012) introduced significant improvements to the code generator, at least in this aspect. This and later versions of the compiler now seem to be at least aware of the existence of the CMOVcc instructions, and may emit them under the right conditions (i.e., /arch:SSE or later, and use of the conditional operator, as described below).
I've found that the most effective way to cajole the compiler into emitting a CMOV instruction is to use the conditional operator instead of a long-form if-else statement. Although these two constructs should be completely equivalent as far as the code generator is concerned, they are not.

In other words, while you might see the following translated to a branchless CMOVLE instruction:
```
int value = (a < b) ? a : b;
```
you will always get branching code for the following sequence:
```
int value;
if (a < b)    value = a;
else          value = b;
```
At the very least, even if your use of the conditional operator doesn't cause a CMOV instruction (such as on MSVC 10 or earlier), you might still be lucky enough to get branchless code by some other means—e.g., SETcc or clever use of SBB and NEG/NOT/INC/DEC. This is what the disassembly you've shown in the question uses, and although it is not quite as optimal as CMOVcc, it's certainly comparable and the difference is not worth worrying about. (The only other branching instruction is part of the loop.)

If you truly want branchless code (which you often do when hand-optimizing), and you're not having any luck getting the compiler to generate the code you want, you'll need to get more clever in how you write the source code. I've had good luck with writing code that computes the result branchlessly using bitwise or arithmetic operators.

For example, you might wish that the following function generated optimal code:

int Minimum(int a, int b)
{
    return (a < b) ? a : b;
}

You followed rule #2 and used a conditional operator, but if you're using an older version of the compiler, you'll get branching code anyway. Outsmart the compiler using the classic trick:

int Minimum_Optimized(int a, int b)
{
    return (b + ((a - b) & -(a < b)));
}

The resulting object code is not perfectly optimal (it contains a CMP instruction that is redundant since SUB already sets the flags), but it is branchless and will therefore still be significantly faster than the original attempt on random inputs that cause branch prediction to fail.

As another example, imagine that you want to determine whether a 64-bit integer is negative in a 32-bit application. You write the following self-evident code:

bool IsNegative(int64_t value)
{
    return (value < 0);
}

and will find yourself sorely disappointed in the results. GCC and Clang optimize this sensibly, but MSVC spits out a nasty conditional branch. The (non-portable) trick is realizing that the sign bit is in the upper 32 bits, so you can isolate and test that explicitly using bitwise manipulation:

bool IsNegative_Optimized(int64_t value)
{
    return (static_cast<int32_t>((value & 0xFFFFFFFF00000000ULL) >> 32) < 0);
}

In addition, one of the commentators suggests using inline assembly. While this is possible (Microsoft's 32-bit compiler supports inline assembly), it is often a poor choice. Inline assembly disrupts the optimizer in rather significant ways, so unless you're writing significant swaths of code in inline assembly, there is unlikely to be a substantial net performance gain. Furthermore, Microsoft's inline assembly syntax is extremely limited. It trades flexibility for simplicity in a big way. In particular, there is no way to specify input values, so you're stuck loading the input from memory into a register, and the caller is forced to spill the input from a register to memory in preparation. This creates a phenomenon I like to call "a whole lotta shufflin' goin' on", or for short, "slow code". You don't drop to inline assembly in cases where slow code is acceptable. Thus, it is always preferable (at least on MSVC) to figure out how to write C/C++ source code that persuades the compiler to emit the object code you want. Even if you can only get close to the ideal output, that's still considerably better than the penalty you pay for using inline assembly.

Note that none of these contortions are necessary if you target x86-64. Microsoft's 64-bit C/C++ compiler is significantly more aggressive about using CMOVcc instructions whenever possible, even the older versions. As this blog post explains, the x64 compiler bundled with Visual Studio 2010 contains a number of code-quality improvements, including better identification and use of CMOV instructions.

No special compiler flags or other considerations are necessary here, since all processors that support 64-bit mode support conditional moves. I suppose this is why they were able to get it right for the 64-bit compiler. I also suspect that some of these changes made to the x86-64 compiler in VS 2010 were ported to the x86-32 compiler in VS 2012, explaining why it is at least aware of the existence of CMOV, but it still doesn't use it as aggressively as the 64-bit compiler.

The bottom line is, when targeting x86-64, write the code in the way that makes the most sense. The optimizer actually knows how to do its job!

`(b + ((a - b) & -(a < b)))` also potentially has undefined behaviour if `a-b` overflows. Some modern compilers might then make the assumption that `a-b` can't overflow when optimizing later code. If `a` or `b` are constant (after inlining and constant-propgation), this turns into a simple range-limit that could result in an `if (a > 123456)` being optimized away as always false or always true. In practice, I don't think it will usually be a problem, but modern C++ is unfortunately very far away from being a portable assembly language for targets with sane 2's complement signed integers. — Peter Cordes, Dec 14 '16 at 19:54
Re: inline asm: I [wrote a whole answer](http://stackoverflow.com/questions/3323445/what-is-the-difference-between-asm-and-asm/35959859#35959859) about how bad MSVC-style inline asm is for wrapping single instructions, compared to GNU C syntax (but even then, https://gcc.gnu.org/wiki/DontUseInlineAsm). You might want to link it. BTW, I've never been clear on how it can work for MSVC inline asm to just leave a value in EAX and omit the `return` in a non-void function. Does this disable inlining of the function? Or does MSVC support this idiom when inlining to avoid a store/reload for results? — Peter Cordes, Dec 14 '16 at 20:01
Sure, it potentially has undefined behavior, @Peter. None of this is portable. I've verified that it works correctly on MSVC. It is not one of those Gnu-style compilers that is indistinguishable from an adversary. :-) It might not be the best example, but it is code that I actually use when building for 32-bit on MSVC 2010. I frequently wish there was an actual portable assembly language, but neither C nor C++ are that language. MSVC's inline asm was primarily designed for "escapes", like executing DOS interrupts, back in the day. It wasn't designed for speed, and wasn't extended like GAS. — Cody Gray - on strike, Dec 16 '16 at 09:47
Leaving a value in EAX (or EDX:EAX) at the end of an inline asm block is perfectly legal and valid in MSVC. I think older versions of the compiler (we're talking pre-MSVC 6.0, a very long time ago now) used to issue a warning when you did this, something about execution falls off the end of the function without a return value, but the code generated was still correct in that case. The fact that the warning's been removed suggests to me that this is an explicitly supported idiom. The result never gets stored/reloaded if you write the code correctly. And no, it doesn't disable inlining. — Cody Gray - on strike, Dec 16 '16 at 09:49
Surprisingly, inlining works very well with functions that contain inline asm blocks, *unless* you try to force the function to use the `__fastcall` convention. Then it doesn't get inlined. (`__cdecl` and `__stdcall` both inline fine). This is pretty disappointing, because `__fastcall` would allow the first two parameters to be passed in registers, which would allow you to avoid paying at least some of the load/store penalties for reading parameters. Alas, it is not to be. I've played with tons of different ways of writing the code, and there's just no way around shuffling for the inputs. — Cody Gray - on strike, Dec 16 '16 at 09:51
I think Rust has the potential to be useful as a portable assembly language, but last I heard it didn't really have SIMD support :( Not down to the choice of branching vs. branchless, but for integer stuff if you want popcnt you just write [`var.count_ones()`](https://doc.rust-lang.org/std/primitive.i32.html#method.count_ones), which is a built-in operation for primitive types. Not like C where the portable option is to write a stupid loop and hope the compiler recognizes the idiom. There's also saturating, and wrapping vs. checked add/sub/mul/whatever operations, rotates, etc. — Peter Cordes, Dec 16 '16 at 14:38
What a piece of an answer. Must've taken several hours of investigation to uncover all these findings. Thanks a lot for sharing this, Cody. — alecov, Feb 14 '17 at 15:46
Do you know of any way to get the x64 compiler to *not* emit CMOV in a tight loop? — user541686, Oct 17 '21 at 21:34
@user541686 Possibly... Subtly changing the code should do it. Do you have a repro case? Maybe you should ask a new question about it. Ping me about it here if you do so. — Cody Gray - on strike, Oct 19 '21 at 04:35
I did at some point... it's been some time since I came across this. I do remember massaging it and finding it practically impossible in some particular cases (other cases were easier). I'll ping you if I manage to dig it up. — user541686, Oct 19 '21 at 09:31

Generating CMOV instructions using Microsoft compilers

1 Answers1

Linked