1

Apparently MSVC++2017 toolset v141 (x64 Release configuration) doesn't use FYL2X x86_64 assembly instruction via a C/C++ intrinsic, but rather C++ log() or log2() usages result in a real call to a long function which seems to implement an approximation of logarithm (without using FYL2X). The performance I measured is also strange: log() (natural logarithm) is 1.7667 times faster than log2() (base 2 logarithm), even though base 2 logarithm should be easier for the processor because it stores the exponent in binary format (and mantissa too), and that seems why the CPU instruction FYL2X calculates base 2 logarithm (multiplied by a parameter).

Here is the code used for measurements:

#include <chrono>
#include <cmath>
#include <cstdio>

const int64_t cnLogs = 100 * 1000 * 1000;

void BenchmarkLog2() {
  double sum = 0;
  auto start = std::chrono::high_resolution_clock::now();
  for(int64_t i=1; i<=cnLogs; i++) {
    sum += std::log2(double(i));
  }
  auto elapsed = std::chrono::high_resolution_clock::now() - start;
  double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
  printf("Log2: %.3lf Ops/sec calculated %.3lf\n", cnLogs / nSec, sum);
}

void BenchmarkLn() {
  double sum = 0;
  auto start = std::chrono::high_resolution_clock::now();
  for (int64_t i = 1; i <= cnLogs; i++) {
    sum += std::log(double(i));
  }
  auto elapsed = std::chrono::high_resolution_clock::now() - start;
  double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
  printf("Ln: %.3lf Ops/sec calculated %.3lf\n", cnLogs / nSec, sum);
}

int main() {
    BenchmarkLog2();
    BenchmarkLn();
    return 0;
}

The output for Ryzen 1800X is:

Log2: 95152910.728 Ops/sec calculated 2513272986.435
Ln: 168109607.464 Ops/sec calculated 1742068084.525

So to elucidate these phenomena (no usage of FYL2X and strange performance difference), I would like to also test the performance of FYL2X, and if it's faster, use it instead of <cmath>'s functions. MSVC++ doesn't allow inline assembly on x64, so an assembly file function that uses FYL2X is needed.

Could you answer with the assembly code for such a function, that uses FYL2X or a better instruction doing logarithm (without the need for specific base) if there is any on newer x86_64 processors?

Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
  • 1
    What sort of accuracy/speed trade-off are you looking for? `FYL2X` isn't fast, though not as bad on Ryzen as it is on a modern Intel. – harold Aug 20 '17 at 20:41
  • @harold, I would like the accuracy/speed which the CPU provides natively, i.e. without sofware approximation. If there are multiple options in the hardware, I would like to consider them all. – Serge Rogatch Aug 20 '17 at 20:46
  • Log in a different base is usually done by multiplying the result by a constant ( `log_n(x) = log2(x) * log_n(2)`). Or just baking it in to the constants for the polynomial approximation for `log(mantissa)`. For example, see Agner Fog's vector class library implementation of a high-accuracy `log(__m256d)` and the versions for other bases. (The single-precision version is slightly simpler, not needing as large a polynomial to get near 1 ulp precision). – Peter Cordes Aug 20 '17 at 20:57
  • For a faster approximation for SSE2 log2(), see http://jrfonseca.blogspot.ca/2008/09/fast-sse2-pow-tables-or-polynomials.html. An AVX2 or AVX512 version of that with FMA for the polynomials is extremely efficient, even with a 5th order polynomial that makes it pretty accurate. Using `FYL2X` does not make sense on modern CPUs, unless you're optimizing for code-size, not speed. – Peter Cordes Aug 20 '17 at 21:03
  • See also http://gallium.inria.fr/blog/fast-vectorizable-math-approx/ for some stuff about how to evaluate relative and absolute error in a polynomial approximation, and doing a minimax fix of the coefficients instead of just using a Taylor series expansion. – Peter Cordes Aug 20 '17 at 21:03
  • IDK why the library you're using performs the way it does. Seems odd. – Peter Cordes Aug 20 '17 at 21:05
  • 1
    @PeterCordes I tried some stuff and found that a ratio of second order polynomials could be fitted an order of magnitude better than a fifth order polynomial, so saving the latency of 3 FMAs at the cost of a division, maybe useful occasionally – harold Aug 20 '17 at 21:51
  • @harold: cool! In my use-case, each `log()` was independent, so OOO execution nicely hid the FMA latency, and it wasn't even worth using a higher-ILP shorter-latency polynomial evaluation method (["Estrin's scheme"](https://github.com/darealshinji/vectorclass/blob/master/vectormath_common.h#L196), which does some non-FMA multiplies before the FMAs). BTW, forgot to link Agner's log() / log2() in my previous comment: https://github.com/darealshinji/vectorclass/blob/master/vectormath_exp.h#L772. (Agner doesn't have an official repo, but that one gets updated regularly from upstream). – Peter Cordes Aug 20 '17 at 21:59
  • I think your discussion is more towards an answer to this question: https://stackoverflow.com/questions/45770089/efficient-implementation-of-log2-m256d-in-avx2 – Serge Rogatch Aug 20 '17 at 22:07
  • @harold: When implementing `fast_arcsinh`, I found that (on HSW and SKL) using a real `vsqrtps` was about the same speed as `vrsqrtps` + FMA (for `sqrt(x) = x * rsqrt(x)`) , and didn't bottleneck on divider throughput. Using `vrsqrtps` + a Newton iteration was slower than either. – Peter Cordes Aug 20 '17 at 22:08

1 Answers1

2

Here is the assembly code using FYL2X:

_DATA SEGMENT

_DATA ENDS

_TEXT SEGMENT

PUBLIC SRLog2MulD

; XMM0L=toLog
; XMM1L=toMul
SRLog2MulD PROC
  movq qword ptr [rsp+16], xmm1
  movq qword ptr [rsp+8], xmm0
  fld qword ptr [rsp+16]
  fld qword ptr [rsp+8]
  fyl2x
  fstp qword ptr [rsp+8]
  movq xmm0, qword ptr [rsp+8]
  ret

SRLog2MulD ENDP

_TEXT ENDS

END

The calling convention is according to https://learn.microsoft.com/en-us/cpp/build/overview-of-x64-calling-conventions , e.g.

The x87 register stack is unused. It may be used by the callee, but must be considered volatile across function calls.

The prototype in C++ is:

extern "C" double __fastcall SRLog2MulD(const double toLog, const double toMul);

The performance is 2 times slower than std::log2() and more than 3 times slower than std::log():

Log2: 94803174.389 Ops/sec calculated 2513272986.435
FPU Log2: 52008300.525 Ops/sec calculated 2513272986.435
Ln: 169392473.892 Ops/sec calculated 1742068084.525

The benchmarking code is as follows:

void BenchmarkFpuLog2() {
  double sum = 0;
  auto start = std::chrono::high_resolution_clock::now();
  for (int64_t i = 1; i <= cnLogs; i++) {
    sum += SRPlat::SRLog2MulD(double(i), 1);
  }
  auto elapsed = std::chrono::high_resolution_clock::now() - start;
  double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
  printf("FPU Log2: %.3lf Ops/sec calculated %.3lf\n", cnLogs / nSec, sum);
}
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
  • 1
    What CPU did you test on? I don't expect there are any modern CPUs where x87 `FYL2X` is faster than an accurate SSE2 library implementation, though. That amount of slowness is more than the extra store/reload overhead can account for (for moving data between xmm and x87 registers). – Peter Cordes Aug 20 '17 at 22:48
  • 1
    @PeterCordes , the CPU is AMD Ryzen 1800X at stock clocks. – Serge Rogatch Aug 21 '17 at 09:08