3

I have a simple tagged union of values. The values can either be int64_ts or doubles. I am performing addition on the these unions with the caveat that if both arguments represent int64_t values then the result should also have an int64_t value.

Here is the code:

#include<stdint.h>
union Value {
  int64_t a;
  double b;
};

enum Type { DOUBLE, LONG };

// Value + type.
struct TaggedValue {
  Type type;
  Value value;
};

void add(const TaggedValue& arg1, const TaggedValue& arg2, TaggedValue* out) {
  const Type type1 = arg1.type;
  const Type type2 = arg2.type;
  // If both args are longs then write a long to the output.
  if (type1 == LONG && type2 == LONG) {
    out->value.a = arg1.value.a + arg2.value.a;
    out->type = LONG;
  } else {
    // Convert argument to a double and add it.
    double op1 = type1 == LONG ? (double)arg1.value.a : arg1.value.b; // Why isn't CMOV used?
    double op2 = type2 == LONG ? (double)arg2.value.a : arg2.value.b; // Why isn't CMOV used? 
    out->value.b = op1 + op2;
    out->type = DOUBLE;
  }
}

The output of gcc at -O2 is here: http://goo.gl/uTve18 Attached here in case the link doesn't work.

add(TaggedValue const&, TaggedValue const&, TaggedValue*):
    cmp DWORD PTR [rdi], 1
    sete    al
    cmp DWORD PTR [rsi], 1
    sete    cl
    je  .L17
    test    al, al
    jne .L18
.L4:
    test    cl, cl
    movsd   xmm1, QWORD PTR [rdi+8]
    jne .L19
.L6:
    movsd   xmm0, QWORD PTR [rsi+8]
    mov DWORD PTR [rdx], 0
    addsd   xmm0, xmm1
    movsd   QWORD PTR [rdx+8], xmm0
    ret
.L17:
    test    al, al
    je  .L4
    mov rax, QWORD PTR [rdi+8]
    add rax, QWORD PTR [rsi+8]
    mov DWORD PTR [rdx], 1
    mov QWORD PTR [rdx+8], rax
    ret
.L18:
    cvtsi2sd    xmm1, QWORD PTR [rdi+8]
    jmp .L6
.L19:
    cvtsi2sd    xmm0, QWORD PTR [rsi+8]
    addsd   xmm0, xmm1
    mov DWORD PTR [rdx], 0
    movsd   QWORD PTR [rdx+8], xmm0
    ret

It produced code with a lot of branches. I know that the input data is pretty random i.e it has a random combination of int64_ts and doubles. I'd like to have at least the conversion to a double done with an equivalent of a CMOV instruction. Is there any way I can coax gcc to produce that code? I'd ideally like to run some benchmark on real data to see how the code with a lot of branches does vs one with fewer branches but more expensive CMOV instructions. It might turn out that the code generated by default by GCC works better but I'd like to confirm that. I could inline the assembly myself but I'd prefer not to.

The interactive compiler link is a good way to check the assembly. Any suggestions?

Rajiv
  • 2,587
  • 2
  • 22
  • 33
  • 1
    The CMOV instruction could use more cycles on your platform? It could have something to do with pipelining? Or something else completely. Just because there exists instructions in the instruction set that seem to be more convenient doesn't actually mean they are better in all cases. – Some programmer dude May 19 '15 at 18:18
  • 1
    @JoachimPileborg no doubt that CMOV could be more expensive. I wanted to know if there is a way to coax GCC to output a version with CMOV instructions just so that I could compare. – Rajiv May 19 '15 at 18:20
  • 6
    @Rajiv: How do you propose to use `CMOV` for this? Before demanding that the compiler be able to optimize in this specific way, why don't you check that it is even feasible? (Hint: It's not. You would need to unconditionally convert to `double`, then `CMOV` the double-type from the union. x86 has only integer `CMOV`, and `FCMOV` in the obsolete x87 FPU, not in `SSE/AVX`. Transferring the values to integer registers, `CMOV`ing and moving back to `xmm`-registers would be insane) – EOF May 19 '15 at 23:53
  • For the benefit of future generations, this is not only _not_ insane, it actually makes a lot of sense. If as the OP stated the data is unpredictable, then on average expect half to mispredict at 20 cycles/mispredict. This adds an average of 10 cycles to the critical path. Each CMOV adds 1 cycle of latency. Because the FPU is mostly idle, the FPU ops are nearly free. The critical path is around 8 cycles shorter if he calculates all 4 possible results, copies the doubles back to integer registers, and uses CMOV to sort it out. It also consumes fewer BTB resources. – icecreamsword Apr 09 '23 at 20:27
  • @icecreamsword: Indeed, branchless could make sense, but integer `cmov` might not be optimal. SIMD compare and bitwise AND or blend might work better, like `pcmpeqq` to get a 0 / -1 mask according to `type1 == LONG`, and use that with `blendvpd` to select the original bit-pattern vs. a `cvtsi2sd` result. (Perhaps do both inputs in one vector, but then you'd have to shuffle to set up for adding them.) And of course you'd do `paddq` and `addsd` in parallel, and blend the result according to both Types being LONG or not. – Peter Cordes May 05 '23 at 18:02
  • Sure, I agree. I am not trying to suggest what the optimal branch-free solution is, only to point out that most branch-free solutions will outperform branching solutions _if_ the OP is correct that the BPU mispredict rate is going to be high. Linus's rant some years ago has a lot of people convinced that branches are always better than CMOV. This was not universally true even then, and the predicate of a 2-cycle CMOV on Core 2 is no longer true of modern Intel cores. The better approach will always depend on the input data set. – icecreamsword May 07 '23 at 01:38
  • CMOV only works with "normal" registers, not xmm regs needed for fp ops. There might be some way of generating branchless code using VMASKMOV rather than CMOV. – Chris Dodd May 08 '23 at 07:09

1 Answers1

0
  1. try -mllvm -x86-cmov-converter=false

  2. try __asm workaround from https://gitflic.ru/project/erthink/bsearch-try

  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 06 '23 at 00:35