2

The Intel Intrinsics Guide says that the intrinsic _mm256_rsqrt_ps has a relative error of at most 1.5*2^-12. But, when I compare the result of _mm256_rsqrt_ps to a standard C++ calculation of the inverse of the square root (1.0 / sqrt(x)), I get a relative error much greater than 1.5*2^-12.

I used the following program to test this:

#include <immintrin.h>
#include <iostream>
#include <math.h>

void test(float x) {
  float resP = _mm256_cvtss_f32(_mm256_rsqrt_ps(_mm256_set1_ps(x)));
  float res = 1.0 / sqrt(x);
  float relErr = fabs(resP - res) / res;
  std::cout << "x = " << x << std::endl;
  std::cout << "resP = " << resP << std::endl;
  std::cout << "res = " << res << std::endl;
  std::cout << "relErr = " << relErr << std::endl;
}

int main() {
  test(1e30);
  test(1e-30);
  test(1e17);
  test(1e-17);
}

It outputs the following:

    x = 1e+30
    resP = 1.00007e-15
    res = 1e-15
    relErr = 6.80803e-05
    x = 1e-30
    resP = 9.99868e+14
    res = 1e+15
    relErr = 0.0001316
    x = 1e+17
    resP = 3.16186e-09
    res = 3.16228e-09
    relErr = 0.000132569
    x = 1e-17
    resP = 3.16277e+08
    res = 3.16228e+08
    relErr = 0.000154825

As you can see, the relative error is significantly greater than 1.5*2^-12.

The instruction _mm256_rcp_ps also seems to have a much greater relative error than the intrinsics guide says it has.

Am I doing something wrong? Am I misunderstanding the intrinsics guide? Or is the intrinsics guide wrong?

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
jonicho
  • 55
  • 1
  • 5

1 Answers1

9

Your relative error is in the bound.

1.5*2^-12 = 0.000366

It's just powers of 2 not powers of 10.

Also is does not claim to have this relative error in comparison to the single precision 1/sqrt(x) but to the exact result.

Unlikus
  • 1,419
  • 10
  • 24
  • You are right, I'm dumb. I'm so used to seeing numbers written in scientific notation with base 10, I didn't notice that that are powers of 2 – jonicho Sep 27 '22 at 13:13
  • @user8461272: One way to grok the precision is that it's about 12 correct mantissa bits, about half the number of bits in a float. So [one Newton Raphson](https://stackoverflow.com/questions/31555260/fast-vectorized-rsqrt-and-reciprocal-with-sse-avx-depending-on-precision) iteration gets very close to full precision (maybe +-1ulp?), and is faster if you can't overlap with other work, or on old CPUs. That's why they chose this precision. But without that Newton iteration, it's *very* rough. It's designed as just a good initial guess, or if you really don't need much precision at all. – Peter Cordes Sep 28 '22 at 03:33