20

We find various tricks to replace std::sqrt (Timing Square Root) and some for std::exp (Using Faster Exponential Approximation) , but I find nothing to replace std::log.

It's part of loops in my program and its called multiple times and while exp and sqrt were optimized, Intel VTune now suggest me to optimize std::log, after that it seems that only my design choices will be limiting.

For now I use a 3rd order taylor approximation of ln(1+x) with x between -0.5 and +0.5 (90% of the case for max error of 4%) and fall back to std::log otherwise. This gave me 15% speed-up.

Malady
  • 251
  • 1
  • 12
qwark
  • 493
  • 1
  • 4
  • 15
  • Two upvotes after eight minutes for a blatantly off-topic question – Lightness Races in Orbit Oct 02 '16 at 20:35
  • Ahh yes - the accuracy Vs performance question - but without stating what accuracy would be acceptable, or what was tried I don't think you'll get an 'answer' – UKMonkey Oct 02 '16 at 20:37
  • Float precision would be good enough. I tried to start from a log2 and convert back but very fast log2 are just outputting an int resulting in very poor approximation. Also tried to use the fact that ln(x) is the derivative of t->x^t in t=0 but its no good lead either for computation. – qwark Oct 02 '16 at 20:47
  • 2
    On modern CPUs `std::sqrt` compiles to a single instruction. It's hard to believe that you can do anything faster than that with similar accuracy. – plasmacel Oct 02 '16 at 21:00
  • @plasmacel Please have a look at the link I've put, you'll see that couple instructions maybe way faster than a single one. Enjoy the read. – qwark Oct 02 '16 at 21:05
  • 1
    @user3091460 If `float` precision is sufficient, why not call `logf()` from `cmath`? Or is the issue that you need the full input domain of `double`, but the result computed only to accuracy equivalent to `float` (about 6 decimal digits)? – njuffa Oct 02 '16 at 21:05
  • @njuffa I will try it and compare what the compiler emits. – qwark Oct 02 '16 at 21:06
  • 1
    @user3091460 Well the calculation of the error is not correct on that site. `sqrtss` is accurate to full precision, while `rsqrtss * x` followed by a single Newton-Raphson step still doesn't give full precision. – plasmacel Oct 02 '16 at 21:11
  • @user3091460 Don't forget trying relevant compiler flags as well, in addition to -O3 maybe turning off denormal support (= turning on FTZ [flush-to-zero] mode), turning on "fast math", enabling the use of a vector math library (e.g. with the Intel compiler). – njuffa Oct 02 '16 at 21:13
  • @user3091460 It's very related: http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x – plasmacel Oct 02 '16 at 21:17
  • 2
    What makes you think you're implementation's `std::log` doesn't already use the most efficient algorithm available for your system? If you're willing to sacrifice accuracy for speed (I might say something about getting wrong answers quickly), you need to say so in your question. – Keith Thompson Oct 02 '16 at 21:53
  • 1
    For now I use a3rd order taylor approximation of ln(1+x) with x between -0.5 and +0.5 (90%of the case for max error of 4%) and fall back to std::log otherwise. Gave me 15% speed-up. – qwark Oct 03 '16 at 08:30
  • @plasmacel it's fairly common for software implementations of math functions implemented in terms of SSE to outperform x87 FPU instructions by a good margin – Jean-Michaël Celerier Oct 08 '21 at 13:10
  • @Jean-MichaëlCelerier I didn't talk about x87 FPU instructions, only about x86 SSE. – plasmacel Oct 08 '21 at 15:07

8 Answers8

21

Prior to embarking on the design and deployment of a customized implementation of a transcendental function for performance, it is highly advisable to pursue optimizations at the algorithmic level, as well as through the toolchain. Unfortunately, we do not have any information about the code to be optimized here, nor do we have information on the toolchain.

At the algorithmic level, check whether all calls to transcendental functions are truly necessary. Maybe there is a mathematical transformation that requires fewer function calls, or converts transcendental functions into algebraic operation. Are any of the transcendental function calls possibly redundant, e.g. because the computation is unnecessarily switching in and out of logarithmic space? If the accuracy requirements are modest, can the whole computation be performed in single precision, using float instead of double throughout? On most hardware platforms, avoiding double computation can lead to a significant performance boost.

Compilers tend to offer a variety of switches that affect the performance of numerically intensive code. In addition to increasing the general optimization level to -O3, there is often a way to turn off denormal support, i.e. turn on flush-to-zero, or FTZ, mode. This has performance benefits on various hardware platforms. In addition, there is often a "fast math" flag whose use results in slightly reduced accuracy and eliminates overhead for handling special cases such as NaNs and infinities, plus the handling of errno. Some compilers also support auto-vectorization of code and ship with a SIMD math library, for example the Intel compiler.

A custom implementation of a logarithm function typically involves separating a binary floating-point argument x into an exponent e and a mantissa m, such that x = m * 2e, therefore log(x) = log(2) * e + log(m). m is chosen such that it is close to unity, because this provides for efficient approximations, for example log(m) = log(1+f) = log1p(f) by minimax polynomial approximation.

C++ provides the frexp() function to separate a floating-point operand into mantissa and exponent, but in practice one typically uses faster machine-specific methods that manipulate floating-point data at the bit level by re-interpreting them as same-size integers. The code below for the single-precision logarithm, logf(), demonstrates both variants. Functions __int_as_float() and __float_as_int() provide for the reinterpretation of an int32_t into an IEEE-754 binary32 floating-point number and vice-versa. This code heavily relies on the fused multiply-add operation FMA supported directly in the hardware on most current processors, CPU or GPU. On platforms where fmaf() maps to software emulation, this code will be unacceptably slow.

#include <cmath>
#include <cstdint>
#include <cstring>

float __int_as_float (int32_t a) { float r; memcpy (&r, &a, sizeof r); return r;}
int32_t __float_as_int (float a) { int32_t r; memcpy (&r, &a, sizeof r); return r;}

/* compute natural logarithm, maximum error 0.85089 ulps */
float my_logf (float a)
{
    float i, m, r, s, t;
    int e;

#if PORTABLE
    m = frexpf (a, &e);
    if (m < 0.666666667f) {
        m = m + m;
        e = e - 1;
    }
    i = (float)e;
#else // PORTABLE
    i = 0.0f;
    if (a < 1.175494351e-38f){ // 0x1.0p-126
        a = a * 8388608.0f; // 0x1.0p+23
        i = -23.0f;
    }
    e = (__float_as_int (a) - __float_as_int (0.666666667f)) & 0xff800000;
    m = __int_as_float (__float_as_int (a) - e);
    i = fmaf ((float)e, 1.19209290e-7f, i); // 0x1.0p-23
#endif // PORTABLE
    /* m in [2/3, 4/3] */
    m = m - 1.0f;
    s = m * m;
    /* Compute log1p(m) for m in [-1/3, 1/3] */
    r =             -0.130310059f;  // -0x1.0ae000p-3
    t =              0.140869141f;  //  0x1.208000p-3
    r = fmaf (r, s, -0.121483512f); // -0x1.f198b2p-4
    t = fmaf (t, s,  0.139814854f); //  0x1.1e5740p-3
    r = fmaf (r, s, -0.166846126f); // -0x1.55b36cp-3
    t = fmaf (t, s,  0.200120345f); //  0x1.99d8b2p-3
    r = fmaf (r, s, -0.249996200f); // -0x1.fffe02p-3
    r = fmaf (t, m, r);
    r = fmaf (r, m,  0.333331972f); //  0x1.5554fap-2
    r = fmaf (r, m, -0.500000000f); // -0x1.000000p-1  
    r = fmaf (r, s, m);
    r = fmaf (i,  0.693147182f, r); //  0x1.62e430p-1 // log(2)
    if (!((a > 0.0f) && (a < INFINITY))) {
        r = a + a;  // silence NaNs if necessary
        if (a  < 0.0f) r = INFINITY - INFINITY; //  NaN
        if (a == 0.0f) r = -INFINITY;
    }
    return r;
}

As noted in the code comment, the implementation above provides faithfully-rounded single-precision results, and it deals with exceptional cases consistent with the IEEE-754 floating-point standard. The performance can be increased further by eliminating special-case support, eliminating the support for denormal arguments, and reducing the accuracy. This leads to the following exemplary variant:

/* natural log on [0x1.f7a5ecp-127, 0x1.fffffep127]. Maximum relative error 9.4529e-5 */
float my_faster_logf (float a)
{
    float m, r, s, t, i, f;
    int32_t e;

    e = (__float_as_int (a) - 0x3f2aaaab) & 0xff800000;
    m = __int_as_float (__float_as_int (a) - e);
    i = (float)e * 1.19209290e-7f; // 0x1.0p-23
    /* m in [2/3, 4/3] */
    f = m - 1.0f;
    s = f * f;
    /* Compute log1p(f) for f in [-1/3, 1/3] */
    r = fmaf (0.230836749f, f, -0.279208571f); // 0x1.d8c0f0p-3, -0x1.1de8dap-2
    t = fmaf (0.331826031f, f, -0.498910338f); // 0x1.53ca34p-2, -0x1.fee25ap-2
    r = fmaf (r, s, t);
    r = fmaf (r, s, f);
    r = fmaf (i, 0.693147182f, r); // 0x1.62e430p-1 // log(2) 
    return r;
}
njuffa
  • 23,970
  • 4
  • 78
  • 130
  • 1
    Thks for that, however i cant find int_as_float and float_as_int using Msvc 15 on win10. I found that its part of cuda but didnt download the full package. – qwark Oct 03 '16 at 08:27
  • 1
    @user3091460 These functions are *abstractions* of machine-specific functionality. As a first step, you can simply use `memcpy()`, e.g. `float __int_as_float(int32_t a) { float r; memcpy (&r, &a, sizeof(r)); return r;}` A good compiler will likely optimize this appropriately, but depending on the hardware you are targeting (which you haven't disclosed), there may be better ways, possibly involving intrinsics or inline assembly. – njuffa Oct 03 '16 at 08:59
  • @user3091460 and njuffa: the optimal asm for x86 will probably do any manipulation of floats as integers using SSE2 integer instructions, because the XMM registers are used for both scalar/vector floats and vector ints. So you should probably `_mm_set_ss(your_float)` and `_mm_castps_si128` that to get a `__m128i` that you can manipulate. (That may waste an instruction zeroing the high bits of the xmm register, [because of design limitations in intrinsics.](http://stackoverflow.com/q/39318496/224132)). A MOVD to get the float bits to/from an integer register might be good, too. – Peter Cordes Oct 03 '16 at 09:15
  • @PeterCordes Understood. I wasn't going to invest the considerable work needed for a turn-key SIMD-intrinsic solution, in particular given that it is still unclear what ISA extensions are available on the asker's hardware. Consider posting your own version using SIMD intrinsics, and I'd be happy to upvote. – njuffa Oct 03 '16 at 09:18
  • I'll just go as far as an efficient float_to_int that uses a union to type-pun, and compiles to a single `movd eax, xmm0` with clang and gcc for x86. https://godbolt.org/g/UCePpA. It's just as easy as you'd hope it would be, @user3091460 :) Manipulating the integer as an `uint32_t` might even be more efficient, since integer instructions are shorter, and on Haswell can run on port6 (which doesn't have any vector ALUs). But probably it would be better to stay in XMM registers, since you're not going very much in integer. – Peter Cordes Oct 03 '16 at 09:24
  • @PeterCordes Note that the version based on memcpy() also works fine, and avoids undefined behavior (AFAIK type punning through unions is not sanctioned by the C++ standard): https://godbolt.org/g/VXo7gj – njuffa Oct 03 '16 at 16:38
  • @njuffa: Oh right, this is a C++ question, [not C](http://stackoverflow.com/questions/11639947/is-type-punning-through-a-union-unspecified-in-c99-and-has-it-become-specified). Compilers are pretty good at optimizing away the memcpy, but I suspect that in more complex cases you might still get an actual store/reload. GNU C++ guarantees that union-based type punning works like it does in C. I think most other C++ compilers guarantee the same thing, but I guess if you get good results from the fully portable way on the target you care about, then go with that. – Peter Cordes Oct 03 '16 at 17:27
  • How did you generate the polynomials for these? – gct Jan 04 '19 at 03:00
  • @SeanMcAllister I generate minimax approximation using the [Remez algorithm](https://en.wikipedia.org/wiki/Remez_algorithm), often followed by refinement based on heuristic searches. – njuffa Jan 04 '19 at 18:36
  • What do you use to generate the approximation? Sollya? Mathematica wasn't great at it. Can you explain your trick for getting a float into a [-1/3,+1/3] range too? – gct Jan 05 '19 at 04:01
  • @SeanMcAllister I use my own software based on Remez. The Sollya tool should work well (I haven't used it myself but worked with a colleague who did). `0x3f2aaaab` is the binary representation of 2/3 in IEEE-754 single precision (I probably should have added a comment there), so you can modify this to reduce to other ranges by adjusting the constant. The rest is straightforward bit manipulation to separate exponent and mantissa (`0xff800000` covers sign and exponent fields). – njuffa Jan 05 '19 at 04:09
  • @SeanMcAllister So `0x3f2aaaab` is used to reduce to [2/3, 4/3]; it is the lower bound of the interval. Other codes you may encounter will reduce to [sqrt(2)/2, sqrt(2)] instead, in which case one would use `0x3f3504f3`. I have also experimented with reduction to [3/4, 3/2], in which case the constant is `0x3f400000` (constants with only few bits are more efficient on some architectures). – njuffa Jan 05 '19 at 04:31
  • @njuffa What could be the reason why I get my_logf()*0.4342944819f 2x faster than std::log10() but, std::log10() over x42 faster than my_faster_logf()*0.4342944819f (Ubuntu/latest GCC in use)? My CPU is an old i5-750. Also, if I would input only in range [0,1] to make linear to dB conversion, could your functions be even faster... but, what all changes should be done to achieve sufficient accuracy of 0.001dB in conversion? – Juha P Mar 26 '21 at 13:38
  • @JuhaP Sorry, I don't understand what your are asking about. This is a Q&A site, not a discussion forum. Consider asking a site-appropriate question if a search for potential duplicates on this side comes up empty. – njuffa Mar 26 '21 at 15:28
  • Thanks for the quick reply. OK, forget the latter question, my main question was why those your two logf() functions have such huge difference in performance (in your text the simplified function my_faster_logf() should increase performance but, for me, its hundred times slower than your my_logf() function is)? – Juha P Mar 26 '21 at 16:44
  • @JuhaP If you compile with the same compiler for the same platform using the same compilation switches, `my_faster_logf()` should be about twice the speed of `my_logf()`. If your platform does not support fused multiply-add (FMA) operations in hardware, calls to `fmaf()` will be very slow, since the functionality will need to be emulated. – njuffa Mar 26 '21 at 17:30
  • @njuffa That also was my thought but, in practice, it isn't the situation. I use exact same switches, even compare against each other (test I'm using results value of 421.346 for my_faster_logf() against 4.17925 for my_logf() (std::log10() gets result of ~9.7) so, my_faster_logf() does not work well (at least in performance testing situation). Here's link for the test routines I'm using: https://www.kvraudio.com/forum/viewtopic.php?p=7170633#p7170633 – Juha P Mar 26 '21 at 18:03
  • @JuhaP Core i5-750 (Lynnfield, ca. 2010) does **not** have FMA in hardware. So I would expect `fmaf()` to be very slow on this processor. The FMA operation was added with Haswell in 2013. Sorry, I am not going to debug for you. You may need to look at the generated code. I just confirmed, again (different compiler and different CPU than when I originally posted this code), that `my_faster_logf()` runs at about twice the speed of `my_logf()` [`PORTABLE = 0`] . – njuffa Mar 26 '21 at 18:15
  • OK, thanks. I'll try to debug the generated code. – Juha P Mar 26 '21 at 19:16
  • @JuhaP For x86-64 platforms, I see around 20 instructions generated for `my_faster_logf()` versus about 45 instructions for the fastpath of `my_logf()`. – njuffa Mar 26 '21 at 19:20
  • @njuffa OK, now that you took all my_logf() functionality out of the "if ((a > 0.0f) && (a <= 3.40282347e+38f)) { // 0x1.fffffep+127 } else { // silence NaNs }" structure, this approximation supports emulated mode. Now I get 1023.46 for the my_logf() approximation ... which I believe is OK result compared to 421.346 for my_faster_logf() ... and this also proves that fmaf() is emulated here. – Juha P Mar 26 '21 at 21:03
  • @JuhaP The code update today merely moved the special case handling to the end, and the generated assembly code differs *minimally* from that generated for the previous version of the code. On my machine here (Intel Xeon W 2133; Skylake-W) I see approximate throughput of one every 2.3ns for `my_faster_logf()` and one every 5.8ns for `my_logf()`, and MSVC compiler's built-in `logf()` at one every 4.8ns. For fast `logf` on [0,1), I would suggest unconditionally multiplying the input of `my_faster_logf()` by `16777216` (= 2**24) and subtracting `16.63553233f` from its result. – njuffa Mar 26 '21 at 21:29
  • 1
    Thank you so much for this!!! Your break down of x to m * 2^e was exactly the missing piece I needed to get good performance for an arbitrary precision decimal implementation I am writing. – StormCrow Oct 10 '21 at 18:42
4

Take a look at this discussion, the accepted answer refers to an implementation of a function for computing logarithms based on the Zeckendorf decomposition.

In the comments in the implementation file there is a discussion about complexity and some tricks to reach O(1).

Hope this helps!

Community
  • 1
  • 1
Mo Abdul-Hameed
  • 6,030
  • 2
  • 23
  • 36
2
#include <math.h>
#include <iostream>

constexpr int LogPrecisionLevel = 14;
constexpr int LogTableSize = 1 << LogPrecisionLevel;

double log_table[LogTableSize];

void init_log_table() {
    for (int i = 0; i < LogTableSize; i++) {
        log_table[i] = log2(1 + (double)i / LogTableSize);
    }
}

double fast_log2(double x) { // x>0
    long long t = *(long long*)&x;
    int exp = (t >> 52) - 0x3ff;
    int mantissa = (t >> (52 - LogPrecisionLevel)) & (LogTableSize - 1);
    return exp + log_table[mantissa];
}

int main() {
    init_log_table();

    double d1 = log2(100); //6.6438561897747244
    double d2 = fast_log2(100); //6.6438561897747244
    double d3 = log2(0.01); //-6.6438561897747244
    double d4 = fast_log2(0.01); //-6.6438919626096089
}
Saar Ibuki
  • 71
  • 1
  • 2
  • Your type-punning violates strict-aliasing. (Use `memcpy` instead of pointer-casting. Also you should probably use `unsigned long long` because you don't need an arithmetic shift. Doesn't matter for correctness on a 2's complement machine, but still.) This also requires that integer endian matches float endian, like on x86, so you should at least document that. – Peter Cordes Sep 04 '18 at 19:20
  • Some text to explain the table-lookup strategy and actual relative / absolute precision over some range of inputs, and limitations like what happens for 0 or negative inputs, would be a good idea. – Peter Cordes Sep 04 '18 at 19:21
  • Your table only needs to be `float`. That would cut your cache footprint in half. (But a 2^14 * 4-byte table is still 64kiB. You will get lots of cache misses in most use-cases, which is why most fast-log implementations use a polynomial approximation on modern CPUs, not a table lookup. Especially when you can use SIMD: [Efficient implementation of log2(\_\_m256d) in AVX2](https://stackoverflow.com/q/45770089)) – Peter Cordes Sep 04 '18 at 19:30
  • 1
    Peter, sorry for commenting on a very old answer, but is the strict-aliasing rule really being broken here? I guess you're referring to the very first line in the fast_log2 function. I'd assume there's really no aliasing here and that the value of x is being copied, reinterpreted as a long long (so very similar behavior to memcpy). Unless I'm missing something, t is not an alias, right? – Asher May 26 '21 at 22:16
  • @Asher - didn't see your reply since you didn't @ notify me. `*(long long*)&x;` dereferences a `long long*` that's pointing to a `double` object. That's strict-aliasing UB. See also *[Unions, aliasing and type-punning in practice: what works and what does not?](https://stackoverflow.com/q/54762186)* Some compilers happen to compile it the same way they would `memcpy(&t, &x, sizeof(t));` or `auto t = std::bit_cast(x);`, as least in simple cases, but it's not guaranteed unless you disable type-based alias analysis optimization with `gcc -fno-strict-aliasing` – Peter Cordes Nov 26 '22 at 23:07
2

Slight improvement in precision and performance:

#include <bit>   // C++20

//fast_log abs(rel) : avgError = 2.85911e-06(3.32628e-08), MSE = 4.67298e-06(5.31012e-08), maxError = 1.52588e-05(1.7611e-07)
const float s_log_C0 = -19.645704f;
const float s_log_C1 = 0.767002f;
const float s_log_C2 = 0.3717479f;
const float s_log_C3 = 5.2653985f;
const float s_log_C4 = -(1.0f + s_log_C0) * (1.0f + s_log_C1) / ((1.0f + s_log_C2) * (1.0f + s_log_C3)); //ensures that log(1) == 0
const float s_log_2 = 0.6931472f;

// assumes x > 0 and that it's not a subnormal.
// Results for 0 or negative x won't be -Infinity or NaN
inline float fast_log(float x)
{
    unsigned int ux = std::bit_cast<unsigned int>(x);
    int e = static_cast<int>(ux - 0x3f800000) >> 23; //e = exponent part can be negative
    ux |= 0x3f800000;
    ux &= 0x3fffffff; // 1 <= x < 2  after replacing the exponent field
    x = std::bit_cast<float>(ux);
    float a = (x + s_log_C0) * (x + s_log_C1);
    float b = (x + s_log_C2) * (x + s_log_C3);
    float c = (float(e) + s_log_C4);
    float d = a / b;
    return (c + d) * s_log_2;
}

Although it has division that considered as slow operation, there are many operations that can be performed in parallel by processor.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
jenkas
  • 872
  • 14
  • 16
2

I propose the following version of the code for ln(x) written in Visual Studio (x86). Here b0,...,b3 is the adjusted coefficients of the polynomial obtained as a Chebyshev approximation for the function

f(t) = lb((5+t)/(5-t))/t , -1<=t<=1 .

First calculated t/5 = (x-1)/(x+1), where x in [0.75; 1.5]. After obtaining the approximation f(t) the result is calculated by the formula

ln(x) = (t*f(t)+k)*ln(2) ,

where k is the number by which it was necessary to reduce the exponent of x to bring it into the range [0.75; 1.5]. Maximum relative error is 2.53708704e-07 and it is 3.49952803 ulps. The RMS relative error is 5.06926602e-08.

_declspec(naked) float _vectorcall ln(float x)
{
  static const float ct[6] =       // Constant table
  {
    1.0f,                          // 1
    0.576110899f,                  // b2*5^5
    0.961808264f,                  // b1*5^3
    2.88539004f,                   // b0*5
    0.442831367f,                  // b3*5^7
    0.693147181f                   // ln(2)
  };
  _asm
  {
    vmovd eax,xmm0                 // Read the binary representation of the x into eax 
    mov edx,-127                   // In edx: neg offset of the exponent of normalized numbers
    cmp eax,0x7F800000             // Compare x with the Inf value
    jnc ln_bad_x                   // Break calculations if x<=-0, x=+Inf or x=+NaN
    ror eax,23                     // Shift eax so that its exponent is in al
    movzx ecx,al                   // Get the exponent with an offset of +127 in ecx
    jecxz ln_denorm                // Jump if x=0 or x is denormalized
      ln_calc:                     // The entry point after normalization of the number
    setnc al                       // al=1 if the mantissa is less than 1.5, otherwise al=0
    adc ecx,edx                    // ecx=k - the integer part of the binary logarithm
    or eax,126                     // Change the exponent of x so that it becomes 0.75<=x<1.5
    ror eax,9                      // In eax: the value of x, in cf: its sign bit
    vmovd xmm0,eax                 // Write the reduced value of x into xmm0
    mov eax,offset ct              // In eax the address of the constant table
    vaddss xmm1,xmm0,[eax]         // xmm1 = x+1
    vsubss xmm0,xmm0,[eax]         // xmm0 = x-1
    vcvtsi2ss xmm3,xmm3,ecx        // xmm3 = k, the integer addition to the binary logarithm
    vdivss xmm0,xmm0,xmm1          // xmm0 = (x-1)/(x+1)=t/5
    vmovss xmm1,[eax+16]           // xmm1 = 5^7*b3 - initialize the sum
    vmulss xmm2,xmm0,xmm0          // xmm2 = t^2/25 - prepare the argument of the polynomial
    vfmadd213ss xmm1,xmm2,[eax+4]  // Step 1 calculating the polynomial by the Нorner scheme
    vfmadd213ss xmm1,xmm2,[eax+8]  // Step 2 calculating the polynomial by the Нorner scheme
    vfmadd213ss xmm1,xmm2,[eax+12] // Step 3 calculating the polynomial by the Нorner scheme
    vfmadd213ss xmm0,xmm1,xmm3     // xmm0 = t*f(t)+k - ready binary logarithm
    vmulss xmm0,xmm0,[eax+20]      // Convert binary logarithm to natural
    ret                            // Return
      ln_denorm:                   // Processing denormalized values of x including x=0
    bsr ecx,eax                    // Search for the highest set bit; zf=1 if x=+0
    mov dl,98                      // 31 is added to the exponent, so we take edx=-158
    ror eax,cl                     // Form a mantissa of a normalized number x*2^31
    jnz ln_calc                    // Go to calculate ln(x) if x>0
    mov dl,128                     // Form the highest word of the value -Inf in dx
    vpinsrw xmm0,xmm0,edx,1        // Replace the highest word of lowest float in xmm0
    ret                            // Return the result –Inf for x=+0
      ln_bad_x:                    // The entry point for x<=-0, x=+Inf or x=+NaN
    jnl ln_exit                    // Return x for x=+Inf or x=+NaN
    vsqrtss xmm0,xmm0,xmm0         // The root of a negative number gives the result NaN
    vrcpss xmm0,xmm0,xmm0          // In the case of x=-0 generate the result -Inf
      ln_exit:                     // The result in xmm0 is ready
    ret                            // Return
  }
}
aenigma
  • 111
  • 4
  • For the fast path (neither branch taken) you could critical path reduce latency by using integer shifts on the XMM register, `vpsrld xmm2, xmm0, 23` in parallel with `vmovd eax, xmm0`. (Subtracting the high bit of the mantissa from an exponent would take more shifts, might be more costly for throughput.) Also, `test al,al` / `jz ln_denorm` would be 1 uop (macro-fused on big-core Intel/AMD), vs. 2 uops for `jecxz` on Intel separate from the 1 uop for `movzx`. Oh, but I guess you're preserving FLAGS from `ror`? Interesting. If you do the shift in SIMD, that might change. – Peter Cordes Feb 01 '23 at 05:58
  • Anyway, looks pretty good; no obvious missed optimizations that could be changed without changing other stuff. (Assuming a CPU without partial-register stalls; using AVX means no P6-family CPU can run this, and only Sandybridge renames AL separately from EAX, but can merge efficiently if necessary. Later CPUs have no penalty for `add al, imm8` followed by reading EAX as part of `ror eax, 9`.) – Peter Cordes Feb 01 '23 at 06:03
  • `salc` is 3 uops on Intel Sandy Bridge and later, so that's not great (https://agner.org/optimize/ includes counts for it; https://uops.info/ doesn't because it's undocumented(!)). `sbb al,al` would also set AL similarly, and I think preserve CF (but with a dependency through it). `setnc al` / `sub al, 130` should work to produce 127 or 126 according to -CF. 2 uops total on AMD or Intel, break-even on AMD vs. saving 2 on Intel. – Peter Cordes Feb 01 '23 at 06:11
  • Peter Cordes, thank you for the interesting comments. Indeed, I am not a big expert in fine optimizations. You're probably right, sbb al,al would be a little better than salc. – aenigma Feb 01 '23 at 06:25
  • All these looks like great answers (post with avx,asm...) but its outside my area of expertise. @PeterCordes do you think you'd have time to benchmark the advanced answers (avx2,avx...) with your added comments ? It also seems all very hardware specific. – qwark Feb 01 '23 at 08:36
  • @qwark: This is just scalar code; the interesting part is the micro-optimization of the bit-manipulation using a rotate and partial-register (`add al` instead of what a compiler would probably do if you wrote it in C, copying a whole 32-bit number and masking + recombining with `or`. But not too many extra instructions since it would probably have shifted instead of rotated.) If you mean other answers like kari's that operates on 8 floats at once (`__m256`), yeah that can be an 8x speedup vs. scalar if you want this for every element of an array. Some other pure C versions can auto-vectorize – Peter Cordes Feb 01 '23 at 16:30
  • 2
    @qwark: To obtain an objective assessment, it is advisable to conduct comparative testing of the algorithms you are interested in on the same machine (and in the same environment, if possible). If the algorithms provide significantly different accuracy, then they can also be compared using, for example, such an indicator as the ratio of operating time to the RMS relative error. – aenigma Feb 02 '23 at 02:32
  • 1
    A good test, in my opinion, is going through all floats from 0 to infinity and comparing the results with the double-precission values obtained by the library log function. – aenigma Feb 02 '23 at 02:46
  • 1
    It is correct to say, not the ratio, but the product of the working time and the error. The less, the better. – aenigma Feb 02 '23 at 03:05
  • 1
    @Peter Cordes: Indeed, to reduce the difference in performance on Intel and AMD, these three instructions (salc : adc ecx,edx : add al,127) can be replaced with the following: setnc al : adc ecx,edx : or al,126. – aenigma Mar 03 '23 at 04:24
1

I vectorized the @njuffa's answer. natural log, works with AVX2:

inline __m256 mm256_fmaf(__m256 a, __m256 b, __m256 c){
    return _mm256_add_ps(_mm256_mul_ps(a, b), c);
}

//https://stackoverflow.com/a/39822314/9007125
//https://stackoverflow.com/a/65537754/9007125
// vectorized version of the answer by njuffa
/* natural log on [0x1.f7a5ecp-127, 0x1.fffffep127]. Maximum relative error 9.4529e-5 */
inline __m256 fast_log_sse(__m256 a){

    __m256i aInt = *(__m256i*)(&a);
    __m256i e =    _mm256_sub_epi32( aInt,  _mm256_set1_epi32(0x3f2aaaab));
            e =    _mm256_and_si256( e,  _mm256_set1_epi32(0xff800000) );
        
    __m256i subtr =  _mm256_sub_epi32(aInt, e);
    __m256 m =  *(__m256*)&subtr;

    __m256 i =  _mm256_mul_ps( _mm256_cvtepi32_ps(e), _mm256_set1_ps(1.19209290e-7f));// 0x1.0p-23
    /* m in [2/3, 4/3] */
    __m256 f =  _mm256_sub_ps( m,  _mm256_set1_ps(1.0f) );
    __m256 s =  _mm256_mul_ps(f, f); 
    /* Compute log1p(f) for f in [-1/3, 1/3] */
    __m256 r =  mm256_fmaf( _mm256_set1_ps(0.230836749f),  f,  _mm256_set1_ps(-0.279208571f) );// 0x1.d8c0f0p-3, -0x1.1de8dap-2
    __m256 t =  mm256_fmaf( _mm256_set1_ps(0.331826031f),  f,  _mm256_set1_ps(-0.498910338f) );// 0x1.53ca34p-2, -0x1.fee25ap-2

           r =  mm256_fmaf(r, s, t);
           r =  mm256_fmaf(r, s, f);
           r =  mm256_fmaf(i, _mm256_set1_ps(0.693147182f),  r);  // 0x1.62e430p-1 // log(2)
    return r;
}
Kari
  • 1,244
  • 1
  • 13
  • 27
  • Note that your `mm256_fmaf` can compile as separate mul and add operations, with rounding of the intermediate product. It's *not* guaranteed to be an FMA. (And only some compilers, like GCC, will "contract" it into an FMA instruction for you, when the target supports FMA like most AVX2 machines do (not quite all: one VIA design). Probably best to just target AVX2+FMA3 and use `_mm256_fmadd_ps`, maybe with an optional fallback if you want, but not a misleadingly named and maybe slower `fma` function by default. – Peter Cordes Jan 03 '21 at 13:30
1

I needed as well a fast log approximation and by far the best one seems to be one based on Ankerls algorithm.

Explanation: http://martin.ankerl.com/2007/10/04/optimized-pow-approximation-for-java-and-c-c/

Code (copied from https://github.com/ekmett/approximate/blob/dc1ee7cef58a6b31661edde6ef4a532d6fc41b18/cbits/fast.c#L127):

double log_fast_ankerl(double a)
{
  static_assert(__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__, "Little endian is required!");
  union { double d; int x[2]; } u = { a };
  return (u.x[1] - 1072632447) * 6.610368362777016e-7;
}

Just one subtraction and multiplication. It's surprisingly good and unbeatable fast.

Kevin Meier
  • 2,339
  • 3
  • 25
  • 52
  • 1
    You left out the extraction of the high half of a `double` as an `int` (e.g. x86 `psrlq xmm0, 32` if your compiler is smart enough to not bounce the value to integer registers and back), and `int` to `double` conversion (e.g. x86 `cvtdq2pd`). It's still pretty fast, especially if your compiler does a good job, but one should check the resulting asm to see if the compiler used a SIMD-integer subtract instead of `movq rax, xmm0` / `shr rax, 32` / `sub eax, 1072632447` / `cvtsi2sd xmm0, eax`. Unfortunately gcc and clang do bounce to scalar integer and back: https://godbolt.org/z/osYoj5bz1 – Peter Cordes Sep 10 '22 at 05:01
-1

It depends how accurate you need to be. Often log is called to get an idea of the magnitude of the number, which you can do essentially for free by examining the exponent field of a floating point number. That's also your first approximation. I'll put in a plug for my book "Basic Algorithms" which explains how to implement the standard library math functions from first principles.

Malcolm McLean
  • 6,258
  • 1
  • 17
  • 18
  • I'm looking for the natural logarithm for real math application, dont need double precision, float precision or even 10-3, 10-4 would be good – qwark Oct 02 '16 at 20:49
  • 2
    a link or book reference without quotation of the relevant parts is not an answer – BeyelerStudios Oct 02 '16 at 20:51