3

Just tell me which one is faster: sub or mul?

My target platform is X86; FPU and SSE.

example:

'LerpColorSolution1' uses multiply.

'LerpColorSolution2' uses subtract.

which is faster ?

void LerpColorSolution1(const float* a, const float* b, float alpha, float* out)
{
    out[0] = a[0] + (b[0] - a[0]) * alpha;
    out[1] = a[1] + (b[1] - a[1]) * alpha;
    out[2] = a[2] + (b[2] - a[2]) * alpha;
    out[3] = a[3] + (b[3] - a[3]) * alpha;
}

void LerpColorSolution2(const float* a, const float* b, float alpha, float* out)
{
    float f = 1.0f - alpha;
    out[0] = a[0]*f + b[0] * alpha;
    out[1] = a[1]*f + b[1] * alpha;
    out[2] = a[2]*f + b[2] * alpha;
    out[3] = a[3]*f + b[3] * alpha;
}

Thanks to all ;)

UfnCod3r
  • 603
  • 1
  • 5
  • 10
  • 3
    What do you mean by "faster"? – Kerrek SB Sep 07 '13 at 18:38
  • 5
    So have you tried to search online? It's pretty well documented which is faster. I'll leave the answer as an exercise to the reader. :) – Mysticial Sep 07 '13 at 18:39
  • 1
    Potentially useful: http://instlatx64.atw.hu/ – harold Sep 07 '13 at 18:46
  • 4
    "faster" is not a meaningful term at the instruction level of a modern CPU; you need to be more specific. Are you interested in throughput or latency? What is the mix of instructions that will be executing in the context surrounding the operation? – Stephen Canon Sep 07 '13 at 18:46
  • Now that you have actual code, it becomes much easier: try both. – harold Sep 07 '13 at 18:49
  • If you really want to go hardcore, you might want to try a combination of both that balances the # of add/subs with multiplies. Then throw in some FMAs. :) – Mysticial Sep 07 '13 at 18:51
  • @Mysticial, have you manged to try your [FLOPS tests](http://stackoverflow.com/questions/8389648/how-to-achieve-4-flops-per-cycle) with FMA/Haswell yet? – Z boson Sep 07 '13 at 19:11
  • A non-trivial question, for IEEE floating-point. A "raw" multiply, on modern hardware, is usually 4-16 times slower than a "raw" add, but floating-point multiply requires essentially no denormalizing and at most one step of renormalizing, while an add can require multi-step denormalizing and renormalizing steps. It's quite possible for the worst-case add to be several times the multiply (which is pretty much always the same speed regardless). – Hot Licks Sep 07 '13 at 19:16
  • @Mysticial - Where is it documented? – Hot Licks Sep 07 '13 at 19:19
  • @HotLicks: there are no "worst cases" for (non-denormal) floating-point multiplies or adds on recent hardware; the performance is not data dependent. In addition, then generally have roughly comparable latency and throughput. – Stephen Canon Sep 07 '13 at 19:24
  • @StephenCanon even denormals, on recent hardware (SB and newer). – harold Sep 07 '13 at 19:29
  • @redrum Not yet. I still don't have any FMA machine. :( My FLOPS test has FMA4 for bulldozer, but that was because I did it while I was still in school and had access to such a machine. – Mysticial Sep 07 '13 at 19:32
  • @harold: there are still some denormal stalls on SNB, only some of them were eliminated. – Stephen Canon Sep 07 '13 at 19:33
  • @HotLicks I was being half serious and half sarcastic. But Agner Fog's tables show that (AFAICT) multiplication is never worse than a subtract in both latency and throughput on all the processors he tested. – Mysticial Sep 07 '13 at 19:37

1 Answers1

10

Just for fun: assuming that you (or your compiler) vectorize both of your approaches (because of course you would if you're chasing performance), and you're targeting a recent x86 processor...

A direct translation of "LerpColorSolution1" into AVX instructions is as follows:

VSUBPS  dst,   a,     b        // a[] - b[]
VSHUFPS alpha, alpha, alpha, 0 // splat alpha
VMULPS  dst,   alpha, dst      // alpha*(a[] - b[])
VADDPS  dst,   a,     dst      // a[] + alpha*(a[] - b[])

The long latency chain for this sequence is sub-mul-add, which has a total latency of 3+5+3 = 11 cycles on most recent Intel processors. Throughput (assuming that you do nothing but these operations) is limited by port 1 utilization, with a theoretical peak of one LERP every two cycles. (I'm intentionally overlooking load/store traffic and focusing solely on the mathematical operation being performed here).

If we look at your "LerpColorSolution2":

VSHUFPS alpha, alpha, alpha, 0 // splat alpha
VSUBPS  dst,   one,   alpha    // 1.0f - alpha, assumes "1.0f" kept in reg.
VMULPS  tmp,   alpha, b        // alpha*b[]
VMULPS  dst,   dst,   a        // (1-alpha)*a[]
VADDPS  dst,   dst,   tmp      // (1-alpha)*a[] + alpha*b[]

Now the long latency chain is shuffle-sub-mul-add, which has a total latency of 1+3+5+3 = 12 cycles; Throughput is now limited by ports 0 and 1, but still has a peak of one LERP every two cycles. You need to retire one additional µop for each LERP operation, which may make throughput slightly slower depending on the surrounding context.

So your first solution is slightly better; (which isn't surprising -- even without this detail of analysis, the rough guideline "fewer operations is better" is a good rule of thumb).


Haswell tilts things significantly in favor of the first solution; using FMA it requires only one µop on each of ports 0,1, and 5, allowing for a theoretical throughput of one LERP per cycle; while FMA also improves solution 2, it still requires four µops, including three that need to execute on port 0 or 1. This limits solution 2 to a theoretical peak of one LERP every 1.5 cycles -- 50% slower than solution 1.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • 1
    In order to get the compiler to make these optimizations, it probably needs additional knowledge about alignment and (lack of) aliasing. Having the function be `static` (so it can be inlined in loops) and using the `restrict` keyword correctly are probably needed to get decent results from the compiler. – R.. GitHub STOP HELPING ICE Sep 07 '13 at 19:43
  • What @R.. says is absolutely right; alignment may not be required, but you should at least add `restrict` to the output pointer if you want the compiler to vectorize for you. If you are going to vectorize yourself, it would be better to simply make the functions take `__m128` arguments instead of pointers. – Stephen Canon Sep 07 '13 at 19:46
  • Well if the function is `static` and `out` either came from a `restrict` source in the caller, or was newly allocated by `malloc`, then further use of the `restrict` keyword is not needed for the compiler to assume no aliasing. But adding `restrict` to `out` would be best to make sure the compiler optimizes it in as many cases as possible. – R.. GitHub STOP HELPING ICE Sep 07 '13 at 20:00