3

As I understand it, a cmp instruction will set some of the bits in your flags register. You can then use instructions like jle, jnp, etc. to branch based on those.

What I am wondering is how you can recover an integral value from a comparison.

Example: The following is valid c syntax

y = x[a >= 13];

So a is compared with 13 to get a true or false which is interpreted as 1 or 0 respectively. However, that 1 or 0 must be fed into the array access as an integer. What will the compiler do?

Some things that I can think of are:

Do the comparison and then branch to either x[0] or x[1]

Do the comparison and then branch to do either tmp = 0 or tmp = 1 then do x[tmp]

Maybe do some fancy logic on the flags (not sure if there are instructions to access the flags directly)

I've tried to look at what gcc spits out for this code example but it's impossible to pick out the logic from within all of the extra junk it throws in.

I am working on a compiler so any suggestions would be appreciated.

Chuck
  • 431
  • 4
  • 10

1 Answers1

6

There are basically three ways that this can be done. I'll go over them one at a time.

One way to do it is basically what you describe in the question: do the comparison, and then branch to code that separately implements both possibilities. For example:

    cmp  [a], 13                     ; compare 'a' to 13, setting flags like subtraction
    jge  GreaterThanOrEqual          ; jump if 'a' >= 13, otherwise fall through

    mov  eax, [x * 0 * sizeof(x[0])] ; EAX = x[0]
    jmp  Next                        ; EAX now loaded with value, so do unconditional jump

GreaterThanOrEqual:
    mov  eax, [x * 1 * sizeof(x[0])] ; EAX = x[1]
                                     ; EAX now loaded with value; fall through

Next:
    mov  [y], eax                    ; store value of EAX in 'y'

Normally, a compiler would try and keep more values in registers, but this should give you an idea of the basic logic. It does the comparison, and either branches to an instruction that reads/loads x[1] or falls through to an instruction that reads/loads x[0]. Then, it moves onto an instruction that stores that value into y.

You should be able to see that this is relatively inefficient because of all the branching that's required. As such, optimizing compilers won't generate code like this, especially in the simple case where you have a basic ternary expression:

(a >= 13) ? 1 : 0

or even:

(a >= 13) ? 125 : -8

There are bit-twiddling tricks that can be used to do this comparison and get the corresponding integer without ever having to do a branch.

This brings us to the second way of doing it, which is to use the SETcc instruction. The cc part stands for "condition code", and all of the condition codes are the same as those for the conditional-jump instructions. (In fact, you could write all of the conditional-jump instructions as Jcc.) For example, jge means "jump if greater-than-or-equal"; similarly, setge means "set if greater-than-or-equal". Simple.

The trick about SETcc is that it sets a BYTE-sized register, which basically means AL, CL, DL, or BL (there are more options; you could set the high byte of one of those registers, and/or in 64-bit long mode there are more options, but these are the basic choices for operands).

Here is an example of the code implementing this strategy:

xor   edx, edx        ; clear EDX
cmp   [a], 13         ; compare 'a' to 13, setting flags like subtraction
setge dl              ; set DL to 1 if greater-than-or-equal, or 0 otherwise
mov   eax, [x * edx * sizeof(x[0])]
mov   [y], eax

Cool, right? Branch eliminated. The required 0 or 1 is loaded directly into DL, and then this is used as part of the load (MOV instruction).

The only thing slightly confusing here is that you need to know that DL is the low byte of the full 32-bit EDX register. This is why we needed to pre-clear the full EDX, since setge dl only affected the low byte, but we want the full EDX to be either 0 or 1. It turns out that pre-zeroing the full register is the most optimal way of doing this on all processors, but there are other ways, like using MOVZX after the SETcc instruction. The linked answer goes into great detail about that, so I won't belabor it here. The key point is just that SETcc only sets the low byte of a register, but subsequent instructions require the entire 32-bit register to have the value, so you need to eliminate garbage in the upper bytes.

Anyway, this is the code that compilers will generate 99% of the time when you write something like y = x[a >= 13]. The SETcc instruction gives you a way to set a byte according to the status of one or more flags, just like you could branch on the flags. This would basically be what you were thinking of an instruction that allows accessing the flags directly.

This implements the logic for

(a >= 13) ? 1 : 0

but what if you wanted to do

(a >= 13) ? 125 : -8

like I mentioned earlier? Well, you still use the SETcc instruction, but you do a little bit of fancy bit-twiddling afterwards to "fix up" the resulting 0 or 1 to the values you actually want. For example:

xor   edx, edx
cmp   [a], 13
setge dl
dec   edx
and   dl, 123
add   edx, 125
; do whatever with EDX

This works for just about any binary choice (two possible values, depending on a condition), and optimizing compilers are smart enough to work this out. Still branchless code; very cool.

There is a third way that this could be implemented, but it is conceptually very similar to the second way that we've just been talking about. It uses a conditional move instruction, which is just another way of doing a branchless set based on the status of flags. The conditional move instruction is CMOVcc, where cc again refers to "condition code", exactly as it has in the previous examples. The CMOVcc instruction was introduced with the Pentium Pro circa 1995, and has been in all processors since (well, not the Pentium MMX, but Pentium II and later), so essentially everything that you would ever see today.

The code is very similar, except that it's—as the name suggests—a conditional move, so a bit more preliminary setup is required. Specifically, you need to get the candidate values loaded into registers, so that you can select the right one:

xor    edx, edx    ; EDX = 0
mov    eax, 1      ; EAX = 1
cmp    [a], 13     ; compare 'a' to 13 and set flags
cmovge edx, eax    ; EDX = (a >= 13 ? EAX : EDX)
mov    eax, [x * edx * sizeof(x[0])]
mov    [y], eax

Notice that the move of EAX into EDX is conditional—it happens only if the flags indicate the condition ge (greater-than-or-equal-to). So, it works out to the basic C ternary operation, as noted in the comment to the right of the instruction. If the flags indicate ge, then EAX is moved into EDX. Otherwise, nothing is moved, and EDX keeps its original value.

Note that, although certain compilers (Intel's compiler specifically, known as ICC) will prefer CMOV instructions over SET instructions, this has no advantage over the previous implementation that we saw earlier with SETGE. In fact, it is really sub-optimal.

When CMOV really comes into its own is to allow you to eliminate that bit-twiddling code required to get values other than the good old 0 or 1. For example:

mov    edx, -8     ; EDX = -8
mov    eax, 125    ; EAX = 125
cmp    [a], 13     ; compare 'a' to 13 and set flags
cmovge edx, eax    ; EDX = (a >= 13 ? EAX : EDX)
; do whatever with EDX

This is now fewer instructions, because the correct value was moved directly into the EDX register, as opposed to setting either 0 or 1 and then having to manipulate it into the values we want. Compilers will therefore use CMOV instructions (when targeting processors that support them, as already mentioned) to implement more complex logic like

(a >= 13) ? 125 : -8

even though they could do them using one of the other approaches. You also need conditional moves when the operands on either side of the condition are not compile-time constants (i.e., they are values in registers, known only at run-time).

Does that help? :-)

I've tried to look at what gcc spits out for this code example but it's impossible to pick out the logic from within all of the extra junk it throws in.

Yeah. I have a couple of hints for you:

  1. Pare down your code to a very simple function that only does what you want to study. You'll want to take an input as a parameter (so that the optimizer can't trivially fold a constant), and you'll want to return the output from the function. For example:

    int Foo(int a)
    {
        return a >= 13;
    }
    

    Returning a bool would have also worked here. If you'd used a conditional operator to return something other than 0 or 1, you'd need to return an int, of course.

    Either way, now you can see exactly what assembly instructions the compiler is generating to implement this, without any other noise. Make sure that you've enabled optimizations; looking at debugging code is not instructive and very noisy.

  2. Make sure that you are asking GCC to generate assembly listings using the Intel/MASM format, which is much easier to read (in my opinion, at least) than its default format, the GAS/AT&T syntax. All of my assembly code examples above are written using Intel syntax. The required incantation would be:

    gcc -S -masm=intel MyFile.c
    

    where -S generates an assembly listing for the input source code file, and -masm=intel switches the assembly-listing syntax format to Intel style.

  3. Use a nice tool like the Godbolt Compiler Explorer, which automates all of this to substantially decrease the turn-around time. As another bonus, it color-codes the assembly instructions to match with the lines of C code in the original source.

    Here is an example of what you'd use to study this. The original source is on the far left. The middle pane shows GCC 7.1's assembly output for a modern processor, which supports CMOV instructions. The far right pane shows GCC 7.1's assembly output for a very old processor that does not support CMOV instructions. Cool, right? And you can easily manipulate the compiler switches and watch how the output changes. For example, if you do -m64 (64-bit) instead of -m32 (32-bit), then you'll see that the parameter is passed in a register (EDI), rather than being passed on the stack and having to be loaded into a register as the very first instruction in the function.

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
  • Thank you, this was helpful. I've never seen compare codes used in this way, only ever for jump instructions. Is there a comprehensive list of instructions which can be appended with compare codes? I've found the following references for [jump](https://c9x.me/x86/html/file_module_x86_id_146.html), [cmove](http://x86.renejeschke.de/html/file_module_x86_id_34.html), and [set](http://c9x.me/x86/html/file_module_x86_id_288.html). – Chuck Jul 19 '17 at 15:28
  • 1
    Those are the big three that use condition codes, @Chuck. There are other instructions like [LOOP](http://x86.renejeschke.de/html/file_module_x86_id_161.html) and the [REP prefix](http://x86.renejeschke.de/html/file_module_x86_id_279.html) that use a *subset* of the condition codes, but only a very small subset. And `LOOP` isn't even used anymore—it stopped being fast around the 286. With `Jcc`, `SETcc`, and `CMOVcc`, you can do everything that needs doing. – Cody Gray - on strike Jul 19 '17 at 16:55
  • I don't think it is totally clear that the `cmovcc` variants are strictly worse than the `setcc` instructions: the `setcc` variants have no less than two writes to `dl` followed immediately by reads from `edx` which is a partial register stall on on platforms up to Haswell (I think?). The `cmovcc` avoids this by always operating on 32-bit registers. So there are probably a range of machines where it is better. It also seems like it is fewer instructions (both of your excerpts have 6, but the `cmovcc` example seems to have extra stuff like the write to `[y]` at the end?). – BeeOnRope Jul 19 '17 at 20:40
  • On recent archs, the partial register stalls have (mostly?) disappeared, first only implying an extra uop, then more or less entirely without even needing the new uop, but `cmov` has also dropped down to 1 cycle/1 uop, so they are roughly tried when the supporting instructions have a similar close (but since `cmov` is more powerful you'll often need fewer pre/post instructions as you point out). – BeeOnRope Jul 19 '17 at 20:42
  • There's no partial register stall here. The `dl` register is pre-zeroed with a `XOR` instruction. All covered quite well in [Peter's epic answer that I linked](https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and). There are two versions of the `SETcc` and `CMOVcc` code samples. The first one in each category shows selecting a 0 or 1; the second one in each category shows selecting between two other integer values. `CMOVcc` is probably more optimal in the latter case, because it avoids all the bit-twiddling. – Cody Gray - on strike Jul 19 '17 at 20:45