13

Recently i was working with an application that had code similar to:

for (auto x = 0; x < width - 1 - left; ++x)
{
    // store / reset points
    temp = hPoint = 0;
    for(int channel = 0; channel < audioData.size(); channel++)
    {
        if (peakmode) /* fir rms of window size */
        {
            for (int z = 0; z < sizeFactor; z++)
            {
                temp += audioData[channel][x * sizeFactor + z + offset];
            }
            hPoint += temp / sizeFactor;
        }
        else /* highest sample in window */
        {
            for (int z = 0; z < sizeFactor; z++)
            {
                temp = audioData[channel][x * sizeFactor + z + offset];
                if (std::fabs(temp) > std::fabs(hPoint))
                hPoint = temp;
            }
        }
        .. some other code
    }
    ... some more code
}

This is inside a graphical render loop, called some 50-100 times / sec with buffers up to 192kHz in multiple channels. So it's a lot of data running through the innermost loops, and profiling showed this was a hotspot.

It occurred to me that one could cast the float to an integer and erase the sign bit, and cast it back using only temporaries. It looked something like this:

if ((const float &&)(*((int *)&temp) & ~0x80000000) > (const float &&)(*((int *)&hPoint) & ~0x80000000))
    hPoint = temp;

This gave a 12x reduction in render time, while still producing the same, valid output. Note that everything in the audiodata is sanitized beforehand to not include nans/infs/denormals, and only have a range of [-1, 1].

Are there any corner cases where this optimization will give wrong results - or, why is the standard library function not implemented like this? I presume it has to do with handling of non-normal values?

e: the layout of the floating point model is conforming to ieee, and sizeof(float) == sizeof(int) == 4

R. Martinho Fernandes
  • 228,013
  • 71
  • 433
  • 510
Shaggi
  • 1,121
  • 1
  • 9
  • 31
  • 4
    it depends on the floating number representation. in other words, it is a dirty trick – BЈовић May 06 '14 at 09:08
  • 5
    Dirty tricks are exactly what the implementation should be doing for you. They're the implementation- they implement for a specific platform and float representation and all the rest. – Puppy May 06 '14 at 09:13
  • would it help to store fabs(hpoint) when you calculate the update to hpoint? This would save sizeFactor-1 calls to fabs for each channel. – Richard Hodges May 06 '14 at 09:13
  • @RichardHodges: The compiler optimizer probably already hoisted that call out of the loop. – Puppy May 06 '14 at 09:13
  • 2
    :-) "probably" is not a word that one should encounter in software engineering. Let's check it. – Richard Hodges May 06 '14 at 09:15
  • If you haven't changed this code too much from the original: Why is the `if(peakmode)` in the loop? Why isn't the loop invariant `x * sizeFactor + offset` computed once in the outer loop? Why is there a division `/sizeFactor` each time `hPoint` is incremented? --- more of the "has the optimizer..." – laune May 06 '14 at 09:19
  • @DeadMG in this case, with all optimizations enabled, the resulting assembly still produced a call to library std::fabs.. RichardHodges true - didn't think of this. It will however require another local that stores the complete value (i need the value with correct sign afterwards) but good idea! – Shaggi May 06 '14 at 09:20
  • 1
    If all the audiodata has been normalized beforehand and is in the range `[-1,+1]`: Why aren't computations done using some fixed point (binary) arithmetic? – laune May 06 '14 at 09:22
  • stack-based variables are extremely cheap - they live in the cache for the duration of any reasonable function. You might find that your compiler has an intrinsic equivalent to std::fabs(). Which compiler are you using? – Richard Hodges May 06 '14 at 09:23
  • @laune The loop is a lot bigger than this. Peakmode is constant bool throughout the function, and i'm assuming branch prediction should reduce this problem significantly. The loop invariant is optimized by the compiler. Do you think integer math will be faster in this case (there isn't actually any floating point math here, anyway). – Shaggi May 06 '14 at 09:26
  • 1
    @RichardHodges Msvc++ (and llvm) but for this specific case it's visual c++ on 32 bit – Shaggi May 06 '14 at 09:27
  • 2
    If you want to get crazy about optimisation, there's this: http://msdn.microsoft.com/en-us/library/bb513994(v=vs.90).aspx I'll also post an answer with some suggestions for optimising the function in a portable way. – Richard Hodges May 06 '14 at 09:29
  • 4
    What a coincidence. Just yesterday I did some timings on various abs implementations on my box: http://manwe.flamingdangerzone.com:9999/abs.html – R. Martinho Fernandes May 06 '14 at 09:35
  • @R.MartinhoFernandes Interesting, ill try out fabsf then! – Shaggi May 06 '14 at 09:38
  • 1
    Fascinating benchmark results. Presumably the performance difference is explained by your compiler implementing `::fabsf` as an intrinsic. But it does raise the question of why the float overload of `std::abs` doesn't simply call `::fabs`. – Cody Gray - on strike May 06 '14 at 09:46
  • 2
    There, I added fabs. @Cody oops, I think I forgot to `-O3` yesterday. How embarrassing. Now all of fabs, fabsf, and abs, behave the same (to see that, click the legend to hide/show each series of timings). Compiled with `g++ -O3 -std=c++11 -pthread`, GCC 4.8.2. – R. Martinho Fernandes May 06 '14 at 09:48
  • You're not the first to find `fabs` much slower than it should be. The old Action Quake source code has the comment "microsoft's fabs seems to be ungodly slow" and redefines it to a custom function that just masks off the highest bit, though at some point later this was commented out when portability became more important than performance. See: https://github.com/aq2-tng/aq2-tng/blob/master/source/q_shared.h#L289-L291 and https://github.com/aq2-tng/aq2-tng/blob/master/source/q_shared.c#L226-L231 – Raptor007 Jan 28 '21 at 02:24

6 Answers6

4

Well, you set the floating-point mode to IEEE conforming. Typically, with switches like --fast-math the compiler can ignore IEEE corner cases like NaN, INF and denormals. If the compiler also uses intrinsics, it can probably emit the same code.

BTW, if you're going to assume IEEE format, there's no need for the cast back to float prior to the comparison. The IEEE format is nifty: for all positive finite values, a<b if and only if reinterpret_cast<int_type>(a) < reinterpret_cast<int_type>(b)

MSalters
  • 173,980
  • 10
  • 155
  • 350
4

It occurred to me that one could cast the float to an integer and erase the sign bit, and cast it back using only temporaries.

No, you can't, because this violates the strict aliasing rule.

Are there any corner cases where this optimization will give wrong results

Technically, this code results in undefined behavior, so it always gives wrong "results". Not in the sense that the result of the absolute value will always be unexpected or incorrect, but in the sense that you can't possibly reason about what a program does if it has undefined behavior.

or, why is the standard library function not implemented like this?

Your suspicion is justified, handling denormals and other exceptional values is tricky, the stdlib function also needs to take those into account, and the other reason is still the undefined behavior.

One (non-)solution if you care about performance:

Instead of casting and pointers, you can use a union. Unfortunately, that only works in C, not in C++, though. That won't result in UB, but it's still not portable (although it will likely work with most, if not all, platforms with IEEE-754).

union {
    float f;
    unsigned u;
} pun = { .f = -3.14 };

pun.u &= ~0x80000000;

printf("abs(-pi) = %f\n", pun.f);

But, granted, this may or may not be faster than calling fabs(). Only one thing is sure: it won't be always correct.

Community
  • 1
  • 1
  • 2
    I was under the impression that using temporaries would circumvent the aliasing problem, since I at no point have multiple types of reference to the same raw data? I could understand the problem if i was iterating the data with an int* pointer? – Shaggi May 06 '14 at 09:51
  • 4
    The unions is UB for **exactly** the same reason as the cast. You are accessing an object through an expression whose type is not the type of the object (object has type float, accessed through an expression of type int). – MSalters May 06 '14 at 10:27
  • @MSalters: Of course it's UB since it depends on the internal representation of `float`. So I don't think it's really relevant what the standard says about this code; it's meant to target a particular implementation (IEEE floats and sane bit-twiddling). – Nate Eldredge May 06 '14 at 14:15
  • 1
    @NateEldredge: No, that internal representation is not why it's UB. It's UB because the standard says so. Optimizers are know to aggessively optimize based on that. E.g. "you didn't change a float between these two lines of code, so I don't need to reload a float in a register" - breaks if you tried to modify a float via an int expression. Real-world optimization. – MSalters May 06 '14 at 15:00
  • @MSalters: I apologize. I was thinking of C where the `union` trick is allowed. – Nate Eldredge May 06 '14 at 15:23
  • @MSalters Excuse me, I somehow failed to realize that this question was about C++. Amended my answer accordingly. – The Paramagnetic Croissant May 06 '14 at 18:08
3

You would expect fabs() to be implemented in hardware. There was an 8087 instruction for it in 1980 after all. You're not going to beat the hardware.

Matt
  • 74,352
  • 26
  • 153
  • 180
user207421
  • 305,947
  • 44
  • 307
  • 483
  • 2
    Even if not implemented in hardware, you would expect it to be implemented to be at least as fast as any tricks someone is using. – gnasher729 May 06 '14 at 14:19
1

How the standard library function implements it is .... implementation dependent. So you may find different implementation of the standard library with different performance.

I imagine that you could have problems in platforms where int is not 32 bits. You 'd better use int32_t (cstdint>)

For my knowledge, was std::abs previously inlined ? Or the optimisation you observed is mainly due to suppression of the function call ?

Qassim
  • 96
  • 3
1

Some observations on how refactoring may improve performance:

  • as mentioned, x * sizeFactor + offset can be factored out of the inner loops

  • peakmode is actually a switch changing the function's behaviour - make two functions rather than test the switch mid-loop. This has 2 benefits:

    1. easier to maintain
    2. fewer local variables and code paths to get in the way of optimisation.

  • The division of temp by sizeFactor can be deferred until outside the channel loop in the peakmode version.

  • abs(hPoint) can be pre-computed whenever hPoint is updated

  • if audioData is a vector of vectors you may get some performance benefit by taking a reference to audioData[channel] at the start of the body of the channel loop, reducing the array indexing within the z loop down to one dimension.

  • finally, apply whatever specific optimisations for the calculation of fabs you deem fit. Anything you do here will hurt portability so it's a last resort.

Matt
  • 74,352
  • 26
  • 153
  • 180
Richard Hodges
  • 68,278
  • 7
  • 90
  • 142
  • Thanks for your suggestions! As for #2, the program has a couple of different 'modes', and if i were to give them all a different function this would result in 32 different (but very similar) functions, which made it extremely difficult to maintain actually. I tested it, and the peakmode switch had no significant impact (presumably due to branch prediction), so halving the amount of functions was welcome. #1, #3 and #4 is true, didn't see that - #5 is a bit more complicated as audioData's type is templated – Shaggi May 06 '14 at 10:07
  • @Shaggi #2 - fair enough. #5: auto& row = audioData[channel]; // now operate on row[] in the loop. – Richard Hodges May 06 '14 at 10:42
  • @Shaggi: If you have 32 similar modes, you might want to investigate if the variety can be captured with a template. It sounds like you'd have 5 boolean parameters. – MSalters May 07 '14 at 09:08
  • @MSalters Close - object with many different states! The thing is, the whole system can easily be modularized and written nicely, but it simply has too much of an performance penalty. Right now it is a compromise between maintainability and performance, but not code beauty / readability is left out. Luckily it's only the small, performance-critical part of the application :) – Shaggi May 07 '14 at 11:39
  • @Shaggi: Templates haven't had performance overhead for at least 15 years. (I know, I measured them back in 1999) – MSalters May 07 '14 at 11:44
  • @MSalters I know. But I'm not sure how you would utilize a compile-time concept in dynamic, state-dependant code besides delegating relevant parts into lambda's/functors (which then might as well have been function calls in the first place). – Shaggi May 07 '14 at 12:30
0

In VS2008, using the following to track the absolute value of hpoint and hIsNeg to remember whether it is positive or negative is about twice as fast as using fabs():

int hIsNeg=0 ;
...
//Inside loop, replacing
//    if (std::fabs(temp) > std::fabs(hPoint))
//        hPoint = temp;
if( temp < 0 )
{
    if( -temp > hpoint )
    {
        hpoint = -temp ;
        hIsNeg = 1 ;
    }
}
else
{
    if( temp > hpoint )
    {
        hpoint = temp ;
        hIsNeg = 0 ;
    }
}
...
//After loop
if( hIsNeg )
    hpoint = -hpoint ;
TripeHound
  • 2,721
  • 23
  • 37