6

I'm trying to optimize the following piece of code, which is a bottleneck in my application. What it does: It takes the double values value1 and value2 and tries to find the maximum including a correctional factor. If the difference between both values is greater than 5.0 (the LUT is scaled by factor 10), I can just take the max value of those two. If the difference is smaller than 5.0, I can use the correctional factor from the LUT.

Does anyone have an idea what could be a better style for this piece of code? I don't know where I'm losing time - is it the large number of ifs or the multiplication by 10?

double value1, value2;
// Lookup Table scaled by 10 for (ln(1+exp(-abs(x)))), which is almost 0 for x > 5 and symmetrical around 0. LUT[0] is x=0.0, LUT[40] is x=4.0.
const logValue LUT[50] = { ... }

if (value1 > value2)
{
    if (value1 - value2 >= 5.0)
    {
        return value1;
    }
    else
    {
        return value1 + LUT[(uint8)((value1 - value2) * 10)];
    }
}
else
{
    if (value2 - value1 >= 5.0)
    {
        return value2;
    }
    else
    {
        return value2 + LUT[(uint8)((value2 - value1) * 10)];
    }
}
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
vls
  • 243
  • 1
  • 3
  • 11
  • 2
    Why compute `max()` when you already know the result? – sharptooth Sep 12 '11 at 13:38
  • If this is a *real* bottleneck, GPU's are very good at this (saturation addition and lookup tables). – MSalters Sep 12 '11 at 15:13
  • Rather than truncating the lookup table at 50, think of extending it to the maximum possible value. With an extended LUT, you can get rid of an if test and a branch. Memory is usually cheap, branching usually isn't. – David Hammen Sep 12 '11 at 15:26
  • 1
    @David: Memory isn't that cheap; it pushes stuff out of caches. I'd be more inclined to _shrink_ the LUT by using a non-lineair key. I'd have to plot the LUT to find a cheap key transform, though. – MSalters Sep 12 '11 at 15:52
  • @MSalters: The lookup table is a performance hack. What is desired here is `log(exp(value1)+exp(value2))`, or `value1+log(1.0+exp(-(value1-value2)))`. This true function is too expensive to call, so an LUT of cached values is used instead. The precision requirements apparently aren't very high; there's no interpolation, the table has a granularity of 1/10, and the table stops at `value1-value2 == 5`. – David Hammen Sep 12 '11 at 16:03
  • @David: of course, that's the usual reason for a LUT. My point is that `LUT2[f(x)]` is often more effective than `LUT[x]` because you can choose `f(x)` such that it changes most rapidly where `LUT[x]` does. – MSalters Sep 13 '11 at 07:49
  • Thank you for all your answers. After profiling, the speed difference using the LUT is only a few percent in comparison to just using the max. Therefore, I've got a huge problem calling the max operation 100M+ times per second which I need to optimize further. Looks like I should open a separate thread for that - but this discussion was really helpful. – vls Sep 13 '11 at 09:30
  • I finally had a chance to put my answer to the test. Not only was it significantly faster than the original code (by nearly 2x), it was also faster than the accepted answer from MSalters. Of course results may vary on a different machine. – Mark Ransom Sep 13 '11 at 13:47
  • My answer has an interesting optimization for calculating the max: `max(a,b) == (a + b + abs(a-b))/2`. – Mark Ransom Sep 13 '11 at 19:32

7 Answers7

2

It probably goes down both paths equally enough that you're causing a lot of pipe-lining problems for your processor.

Have you tried profiling?

I'd also suggest trying to use the standard library and see if that helps (for example if it's able to use and processor-specific instructions):

double diff = std::fabs(value1 - value2);
double maxv = std::max(value1, value2);
return (diff >= 5.0) ? maxv : maxv + LUT[(uint8)((diff) * 10)];
Mark B
  • 95,107
  • 10
  • 109
  • 188
  • This looks simple, but computationally it is not. You have three branches; two just happen to be hidden. The fewer branches the better when it comes to performance-critical code. – David Hammen Sep 12 '11 at 13:59
  • +1 While it may be that it only looks simpler superficially, it _does_ convey purpose and intent much more concisely too. IFF the processor has smart instructions to deal with fabs and subsequent rounding, it would apply it (even if it required special intrinsics to be used inside std::fabs). However, all bets are off when 'hide' the `fabs` from the compiler by doing manual branching. (I know that compilers are able to spot such optimizations aggressively; it's just objectively easier for them to do so when you tell the compiler _what_ you need, not _how_) – sehe Sep 12 '11 at 14:09
  • 1
    Isn't `fabs` just a simple bit clearing, at least with IEEE754? Besides, for a decent compiler it would be a single branch anyway: calculate `value1 - value2`, sign flag is now set, branch on it,{ swap value1 and value2 if sign set}. SSA should make this optimization fairly painless. – MSalters Sep 12 '11 at 15:09
  • Compilers are smart, but not that smart. At least gcc and clang aren't. With g++ -O3, this implementation is about 20% slower than an implementation that takes advantage of the relation between maximum of two numbers and the absolute difference between the numbers. The advantage of the more lengthy approach drops to about 5% with clang, but it is still present. – David Hammen Sep 12 '11 at 15:17
  • @David Hammen Could you provide a link to your test hook (and the special implementation you refer to)? – Mark B Sep 12 '11 at 16:04
  • @Mark B: See my answer for that alternate implementation. As far as drivers, I did the standard trick of calling the test code (which is compiled optimized) with a simple driver (which is compiled unoptimized). An optimizing compiler will see that the test driver is doing the same thing over and over, and even worse, is dropping the results on the bit floor. – David Hammen Sep 12 '11 at 16:23
2

I'd probably have written the code a bit different to handle the value2<value1 case:

if (value2 < value1) std::swap(value1, value2);
assert(value1 <= value2); // Assertion corrected
int diff = int((value2 - value1) * 10.0);
if (diff >= 50) diff = 49; // Integer comparison iso floating point
return value2 + LUT[diff];
vls
  • 243
  • 1
  • 3
  • 11
MSalters
  • 173,980
  • 10
  • 155
  • 350
  • 1
    @Mark: My sentiments exactly. Kill the `assert`. Also, LUT[49] is non-zero, so this does not do what the question asked for. – David Hammen Sep 12 '11 at 16:40
  • 1
    @David Hammen, you can always add another element to the LUT and set it to zero, then change the values above to 51/50. – Mark Ransom Sep 12 '11 at 18:25
2

A couple of minutes of playing with Excel produces an approximation equation that looks like it might have the accuracy you need, so you can do away with the lookup table altogether. You still need one condition to make sure the parameters to the equation remain within the range that it was optimized for.

double diff = abs(value1 - value2);
double dmax = (value1 + value2 + diff) * 0.5; // same as (min+max+(max-min))/2
if (diff > 5.0)
    return dmax;
return dmax + 4.473865638/(2.611112371+diff) + 0.088190879*diff + -1.015046114;

P.S. I'm not guaranteeing this is faster, only that it's a different enough approach to be worth benchmarking.

P.P.S. It's possible to change the constraints to come up with slightly different constants, there are a lot of variations. Here's another set I did where the difference between your table and the formula will always be less than 0.008, also each value will be less than the one preceeding.

return dmax + 3.441318133/(2.296924445+diff) + 0.065529678*diff + -0.797081529;

Edit: I tested this code (second formula) with 100 passes against a million random numbers between 0 and 10, along with the original code from the question, MSalters currently accepted answer, and a brute force implementation max(value1,value2)+log(1.0+exp(-abs(value1-value2))). I tried it on a dual-core AMD Athlon and an Intel quad-core i7 and the results were roughly consistent. Here's a typical run:

  • Original: 1.32 seconds.
  • MSalters: 1.13 seconds.
  • Mine: 0.67 seconds.
  • Brute force: 4.50 seconds.

Processors have gotten unbelievably fast over the years, so fast now that they can do a couple of floating point multiplies and divides faster than they can look up a value in memory. Not only is this approach faster on a modern x86, it's also more accurate; the approximation errors in the equation are much less than the step errors caused by truncating the input for the lookup.

Naturally results can still vary based on your processor and compiler; benchmarking is still required for your own particular target.

Community
  • 1
  • 1
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • +1 I'm not guaranteeing this is faster, only that it's a different enough approach to be worth benchmarking. – Mooing Duck Sep 12 '11 at 19:12
  • A function approximation approach would be better if you could get rid of that branch. Unfortunately, not knowing the range of `abs(value1-value2)`, all we have is 0 to 5, and your stuck with the branch. – David Hammen Sep 12 '11 at 23:22
  • @David Hammen, if the processor has branchless booleans you can multiply the formula by `(diff<5.0)` but I'm not sure it would be faster. The range of the approximation formula can also be made wider at the cost of some accuracy. – Mark Ransom Sep 12 '11 at 23:35
  • @Mark Ransom: Yeah, I played with around a few rational function approximations. (I've used rational approximations of `log(exp(x)+exp(y))` before.) It really would help to know the range of the approximation. Or multiplying by `(diff<5.0)`. But that too is pretty expensive, computation-wise. – David Hammen Sep 13 '11 at 00:29
  • This accuracy should be sufficient, too - I'll try tomorrow! – vls Sep 13 '11 at 22:25
  • Thanks, this works perfectly! Here is my implementation (I was able to use the SSE max before calling the function, as it was already in the register, so value1 >= value2): float diff = value1 - value2; if (diff > 5.0) return value1; return value1 + 3.441318133f / (2.296924445f + diff) + 0.065529678f * diff - 0.797081529f; – vls Sep 14 '11 at 08:38
1

I am going to assume when the function is called, you'll most likely get the part where you have to use the look up table, rather then the >=5.0 parts. In this case, it's better to guide the compiler towards this.

double maxval = value1;
double difference_scaled = (value1-value2)*10;
if (difference < 0)
{
    difference = -difference;
    maxval = value2;
}
if (difference < 50)
    return maxval+LUT[(int)difference_scaled];
else
    return maxval;

Try this and let me know if that improves your program's performance.

Shahbaz
  • 46,337
  • 19
  • 116
  • 182
  • Take a look at macro `likely` too. I'm not sure if it's standard or only in g++, but if your compiler works with it, you could write `if (likely(difference < 50))` so that compiler better optimizes. – Shahbaz Sep 12 '11 at 14:58
  • `likely` is a g++ thing. Many other compilers support Profile-Guided Optimzation, where they'll gather the same information from actual program runs. – MSalters Sep 12 '11 at 15:18
0

The only reason this code would be the bottleneck in your application is because you are calling it many many times. Are you sure you need to? Perhaps the algorithm higher up in the code can be changed to use the comparison less?

Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • Unfortunately, there is no way to reduce the higher-level calls for this function because of logical restrictions. – vls Sep 12 '11 at 13:44
  • _If_ the logic requires calling the function multiple times, you can cache the result, but that may not apply to your situation. – Mooing Duck Sep 12 '11 at 19:15
  • @MSalters: yes it is, however, it's still a choke-point. He _may_ want to cache the results (outside of the function) for the objects having the function applied to them. – Mooing Duck Sep 13 '11 at 16:09
0

You are computing value1 - value2 quite a few times in your function. Just do it once.

That cast to a uint8_t may also be also problematic. A far as performance is concerned, the best integral type to use as a conversion from double to integer is an int, as is the best integral type to use an array index is an int.

max_value = value1;
diff = value1 - value2;
if (diff < 0.0) {
  max_value = value2;
  diff = -diff;
}

if (diff >= 5.0) {
  return max_value;
}
else {
  return max_value + LUT[(int)(diff * 10.0)];
}

Note that the above guarantees that the LUT index will be between 0 (inclusive) and 50 (exclusive). There's no need for a uint8_t here.

Edit
After some playing around with some variations, this is a fairly fast LUT-based approximation to log(exp(value1)+exp(value2)):

#include <stdint.h>

// intptr_t *happens* to be fastest on my machine. YMMV.
typedef intptr_t IndexType;

double log_sum_exp (double value1, double value2, double *LUT) {
  double diff = value1 - value2;
  if (diff < 0.0) {
    value1 = value2;
    diff = -diff;
  }   
  IndexType idx = diff * 10.0;
  if (idx < 50) {
    value1 += LUT[idx];
  }   
  return value1;
}   

The integral type IndexType is one of the keys to speeding things up. I tested with clang and g++, and both indicated that casting to an intptr_t (long on my computer) and using a intptr_t as an index into the LUT is faster than other integral types. It is considerably faster than some types. For example, unsigned long long and uint8_t are incredibly poor choices on my computer.

The type is not just a hint, at least with the compilers I used. Those compilers did exactly what the code told it to do with regard to the conversion from floating point type to integral type, regardless of the optimization level.

Another speed bump results from comparing the integral type to 50 as opposed to comparing the floating point type to 5.0.

One last speed bump: Not all compilers are created equal. On my computer (YMMV), g++ -O3 generates considerably slower code (25% slower with this problem!) than does clang -O3, which in turn generates code that is a bit slower than that generated by clang -O4.

I also played with a rational function approximation approach (similar to Mark Ransom's answer), but the above obviously does not use such an approach.

David Hammen
  • 32,454
  • 9
  • 60
  • 108
  • +0.5 for the thought (using int) but I strongly suspect that in practice the compiler will generate the same code; profile, profile, profile – sehe Sep 12 '11 at 14:12
  • any decent compiler will do CSE and cache the `value1 - value2` instead of recomputing it. – Necrolis Sep 12 '11 at 15:09
  • Common Subexpression Elimination (CSE) is a common optimization (pun intended) – MSalters Sep 12 '11 at 15:10
  • @sehe: Or look at the generated assembly code. Both g++ and clang do generate different code for different casts. Which is faster? Profile, profile, profile. – David Hammen Sep 12 '11 at 15:19
0

I've done some very quick tests but please profile the code yourself to verify the effect.

Changing the LUT[] to a static variable got me a 600% speed up (from 3.5s to 0.6s). Which is close to the absolute minimum of test I used (0.4s). See if that works and re-profile to determine if any further optimization is needed.

For reference, the I was simply timing the execution of this loop (100 million iterations of the inner loop) in VC++ 2010:

int Counter = 0;

for (double j = 0; j < 10; j += 0.001)
    {
        for (double i = 0; i < 10; i += 0.001)
        {
            ++Counter;
            Value1 += TestFunc1(i, j);
        }
    }
uesp
  • 6,194
  • 20
  • 15
  • Why would that be do you think (cost to static speed up)? – Michael Dorgan Sep 12 '11 at 18:12
  • According to the disassembly output (in debug builds) the `LUT[]` is rebuilt on each function call when not a static. You basically get 50 assignments to an array which, of course, kills the performance. I might have assumed that when defined as a `const` the compiler would optimize it away in release builds but it does not appear that way. Note that moving `LUT[]` outside of the function has the same effect on improving performance. – uesp Sep 12 '11 at 19:10
  • It appears that this question: http://stackoverflow.com/questions/1334069/are-const-arrays-declared-within-a-function-stored-on-the-stack explains this behavior of a constant array in a function being stored on the stack. – uesp Sep 12 '11 at 19:12