13

In the GCC (version 4.8.2) manual, the following is stated:

-ftree-loop-if-convert-stores:
Attempt to also if-convert conditional jumps containing memory writes. This transformation can be unsafe for multi-threaded programs as it transforms conditional memory writes into unconditional memory writes. For example,

   for (i = 0; i < N; i++)
      if (cond)
        A[i] = expr;

is transformed to

   for (i = 0; i < N; i++)
       A[i] = cond ? expr : A[i];

potentially producing data races.

I wonder, however, if there is a performance gain by using the operator? versus the if statement.

  • In the first piece of code, A[i] is set to expr only if the condition is met. If it is not met, then the code inside the statement is skipped.
  • In the second one, A[i] seems to be written regardless of the condition; the condition only affects the value it is set to.

By using operator?, we are also doing a check; however, we are adding some overhead in the case that the condition is not met. Have I missed something?

djsp
  • 2,174
  • 2
  • 19
  • 40
  • 3
    I presume the second form is faster as it avoids any chance of [branch prediction](http://en.wikipedia.org/wiki/Branch_misprediction) fails. See [this previous question](http://stackoverflow.com/q/11227809/311966) for details. – simonc Dec 23 '13 at 12:00
  • 3
    This is a bit of a grey area, since it mixes the vendor's model of multithreading, code generation, and C. You need to understand all three to make sense of this. – Kerrek SB Dec 23 '13 at 12:01
  • 1
    Just write code that is readable. Do not try to second guest the optimizer. That is a lot better than you or I at achieving that task – Ed Heal Dec 23 '13 at 12:03
  • 1
    Quite worrying: **testing** the two constructions on the same input array, I get `1000000 loops: #1 = 4.428852 sec, #2 = 5.061400 sec`. (Without using the flag, of course.) This needs inspecting the generated assembler... – Jongware Dec 23 '13 at 13:05
  • (On my test) Alas: the first gets converted to `testb $1, (%rdx); je LBB1_5; movb %dil, (%rsi)` and the second to the same code but with the additional write as well, with an extra jump in between. This tells nothing new. – Jongware Dec 23 '13 at 13:21

1 Answers1

6

What is says is that conditional jumps are converted to conditional move instructions, the cmove family of instructions. They improve speed because they do not stall the processor pipeline like jumps do.

With a jump instructions, you don't know in advanced which instructions to load, so a prediction is used and a branch is loaded in the pipeline. If the prediction was correct, all is well, the next instructions are already executing on the pipeline. However, after the jump is evaluated, if the prediction was wrong, all the following instructions already in the pipeline are useless, so the pipeline must be freed, and the correct instructions are loaded. Modern processors contain 16-30 stages of pipe, and a branch mispredictions degrade performance severely. Conditional moves bypass this because they do not insert branches in the program flow.

But does cmove always write?

From Intel x86 Instruction Set Reference:

The CMOVcc instructions check the state of one or more of the status flags in the EFLAGS register [..] and perform a move operation if the flags are in a specified state (or condition). [..] If the condition is not satisfied, a move is not performed and execution continues with the instruction following the CMOVcc instruction.

Edit

Upon further investigating gcc manual, I got confused, because as far as I know the compiler doesn't optimize transforming C code into another C code, but uses internal data structures like Control Flow Graphs so I really don't know what they mean with their example. I suppose they mean the C equivalent of the new flow generated. I am not sure anymore if this optimization is about generating cmoves.

Edit 2

Since cmove operates with registers and not memory, this

if (cond)
  A[i] = expr

cannot generate cmove.

However this

 A[i] = cond ? expr : A[i];

can.

Suppose we have in bx the expr value.

load A[i] into ax
cmp // cond
cmove ax, bx
store ax into &A[i]

So in order to use cmove you have to read A[i] value and write it back if cond if false, which is not equivalent with the if statement, but with the ternary operator.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
bolov
  • 72,283
  • 15
  • 145
  • 224
  • But does `cmove` always *write*? The caveat in this option seems to be that `A[i]` always gets written. – Jongware Dec 23 '13 at 12:20
  • Bear in mind that unless `A[i]` is `volatile`, it's moot in the false case whether or not the move is performed. For non-volatile, the observable behavior in the absence of data races or other UB is the same either way. So when the code generator sees the transformed code with the `?`, it can decide whether or not to use a cmove instruction on that basis. – Steve Jessop Dec 23 '13 at 12:29
  • Relevant to the performance of cmove: http://stackoverflow.com/questions/14131096/why-is-a-conditional-move-not-vulnerable-for-branch-prediction-failure – Steve Jessop Dec 23 '13 at 12:33
  • 1
    What I don't understand, not being familiar with this optimization, is whether the code generator and/or vectorization pass can't just use a cmove anyway when they see `if (cond) A[i] = expr;`. This leads me to wonder whether this optimization flag really does have anything specifically to do with cmove as you say, or whether it's something else that helps "to remove control-flow from the innermost loops in order to improve the ability of the vectorization pass to handle these loops" (which is the stated intent of `-ftree-loop-if-convert`). – Steve Jessop Dec 23 '13 at 12:38
  • @SteveJessop I think you are right, this optimization doesn't seem to be related to `cmove` – bolov Dec 23 '13 at 12:43
  • A compiler might find it easier to efficiently encode the ternary operation into assembly. Gcc targets different cpus -- the entire `cmove` discussion is obsolete for ARM chips. – Jongware Dec 23 '13 at 12:45
  • `cmovcc` can't write to memory anyway. – harold Dec 23 '13 at 12:47
  • @harold Not directly, no, but a register can be set with the value from the memory, and a cmov can be used to set that register or not to another value, and then a that register is loaded into memory. – bolov Dec 23 '13 at 12:51
  • @bolov but then whatever `cmov` does is irrelevant. It may or may not always write to the register, it doesn't matter, you're unconditionally writing that register to memory with a plain old write. – harold Dec 23 '13 at 12:54
  • 1
    @harold: bolov's latest edit shows it *does* matter. With the `if` construction, the memory may not be written, and using `cmov` it *always* needs to be. – Jongware Dec 23 '13 at 13:03
  • @Jongware that's not due to the semantics of `cmov` itself, and you could do exactly the same thing with branches. I interpret his edit as showing precisely that it *doesn't* matter – harold Dec 23 '13 at 13:05
  • 1
    @harold: but the *caveat* is that this flag will transform an optional write into an "always" write. That is entirely consistent with bolov's suggested assembler code. – Jongware Dec 23 '13 at 13:07
  • @Jongware the flag matters, the semantics of `cmov` don't. My point is, the `cmov` is not the central matter here, it's about how the memory write is done. – harold Dec 23 '13 at 13:08
  • Even without cmov, you could emulate it with addition and multiplication. `a*c +b*!c` -- probably more cycles than cmov, but may be less than branch prediction fails. cmov makes that optimization (unconditional write) faster by doing away with arbitrary addion and multiplication. – Yakk - Adam Nevraumont Dec 23 '13 at 13:23