1

sse newb here...

I'm testing two implementations of a routine that has nested logic: a naive implementation and one where I've been clever to try to remove some of the branching. I'm using 'gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3' on x86 Merom with gcc options '-ffast-math -fomit-frame-pointer -msseregparm -mfpmath=sse -msse2'. Code follows:

#define math_sign(a) ( (a) < .0f ? -1.f : +1.f )

inline float math_interp_clamp(float a, float slope, float target)
{
#if 0
    // 5 instr, 1 branch
    float b = a + slope;
    return slope > 0.f ? (b > target ? target : b) : (b < target ? target : b);
#else
    // 19 instr
    float b = a + slope;
    return ( b - target ) *  math_sign( slope ) > 0.f ? target : b;
#endif
}

With my ifdef enabled I get:

math_interp_clamp:
.LFB505:
    .cfi_startproc
    comiss  .LC7, %xmm1
    addss   %xmm1, %xmm0
    jbe .L44
    minss   %xmm0, %xmm2
    movaps  %xmm2, %xmm0
    ret
.L44:
    maxss   %xmm0, %xmm2
    movaps  %xmm2, %xmm0
    ret
    .cfi_endproc

With my ifdef disabled I get:

math_interp_clamp:
.LFB505:
    .cfi_startproc
    xorps   %xmm5, %xmm5
    addss   %xmm1, %xmm0
    movss   .LC3, %xmm4
    cmpltss %xmm5, %xmm1
    movss   .LC2, %xmm6
    movaps  %xmm0, %xmm3
    andps   %xmm1, %xmm4
    andnps  %xmm6, %xmm1
    subss   %xmm2, %xmm3
    orps    %xmm4, %xmm1
    mulss   %xmm1, %xmm3
    movaps  %xmm5, %xmm1
    cmpltss %xmm3, %xmm1
    movaps  %xmm2, %xmm3
    movaps  %xmm1, %xmm2
    andps   %xmm1, %xmm3
    andnps  %xmm0, %xmm2
    orps    %xmm3, %xmm2
    movaps  %xmm2, %xmm0
    ret
    .cfi_endproc

I have not actually timed the generated code, but on the basis of cycle-count I can't imagine those 19 instructions being faster than a mere branch... How ruthless should I be in avoiding branches, or am I using gcc wrong?

Links to a good timing-howto or sse-tutorial graciously accepted.

  • This is certainly *not* [ladder logic](http://en.wikipedia.org/wiki/Ladder_logic). – Jonathon Reinhart Apr 22 '14 at 05:16
  • Well, ok - it's not for a PLC. There are 2 tiers of logic in there though. –  Apr 22 '14 at 05:20
  • 2
    I would benchmark this, using some "realistic" input data. With today's heavily-pipelined processors, a branch misprediction is *very* expensive. Your branchless code may still outperform the code with a branch. See [this question](http://stackoverflow.com/questions/11227809), for example. – Jonathon Reinhart Apr 22 '14 at 05:29
  • Ah. Yes. I read now that my pipeline is 14 stages - that could upset things abit I suppose. Presently my 'inner loop' is only something like 100 instructions long. –  Apr 22 '14 at 05:35
  • 1
    Branching is not expensive if the branch predictor has good odds of guessing right. This cannot possibly be guessed at from your code snippet, it depends on the actual values of *slope*. Profiling is required to find out if you are ahead. We can't do this for you. – Hans Passant Apr 22 '14 at 06:41
  • Given that I've got a function here that I'm calling from different places using different state I can't imagine the branch predictor has any hope of being right. –  Apr 22 '14 at 07:21
  • @orthopteroid You'd be surprised! But still, a benchmark is the only way to answer this. – Jonathon Reinhart Apr 22 '14 at 23:14

1 Answers1

1

gcc 4.9.2 compiles the branchless version to 16 instructions. Or 19 like yours with -m32 -msse but not -m32 -msse2. There's a decent amount of parallelism in those instructions, but unfortunately FP logicals (like orps) can only run on a single port in current Intel designs. (Compared to all 3 vector ALU ports being able to accept por uops.)

gcc compiles your "naive" version branchlessly, with -O3 -ffast-math. (Without -ffast-math, it still uses a branch. I haven't checked the bit ops to see if the results are identical even with NaNs.) There may be a way to write your naive version so that it can make a branchless version even without -ffast-math. Maybe by !(b < target) instead of b > target, so NaN handling is the same? IDK.

math_interp_clamp:  # gcc 4.9.2 -O3 -ffast-math
.LFB0:
    addss   %xmm1, %xmm0
    movaps  %xmm0, %xmm3
    maxss   %xmm2, %xmm3
    minss   %xmm0, %xmm2
    pxor    %xmm0, %xmm0
    cmpltss %xmm1, %xmm0
    andps   %xmm0, %xmm2
    andnps  %xmm3, %xmm0
    orps    %xmm2, %xmm0
    ret

This is likely a win, unless the branch is very predictable (i.e. values almost never need clamping). Your best best is -fprofile-generate / -fprofile-use, and let the compiler decide.

Out of curiousity, I looked at the branchless version with x87 (-m32 -O3). Even though x87 has fcmovbe, it still compiles to 19 insns because x87 needs extra instructions to pop the stack (and function args don't start in regs).

There is no cmov based on flags for xmm registers. Packed cmov is done with andps / pand or blendv based on a mask register.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Thanks very much for sharing. Perhaps an -O3 in the right place would've helped. I didn't have it globally - it seemed to conflict with the multithreading use of libsdl so I opted to use an O3 equivalent pragma in specific sections. FYI the code ended up [here](https://github.com/orthopteroid/wii-synth/blob/master/demo.c). –  Aug 03 '15 at 23:41
  • @orthopteroid: `-O3` conflicted with multithreading? Maybe you need memory barriers in your code, so it doesn't break with some future smarter compiler. I think `-O3` by itself shouldn't do anything that affects correctness, (for code that's correct in the first place, or something). – Peter Cordes Aug 04 '15 at 01:32