Is it possible to write Quake's fast InvSqrt() function in Rust?

Question

This is just to satisfy my own curiosity.

Is there an implementation of this:

float InvSqrt (float x)
{
   float xhalf = 0.5f*x;
   int i = *(int*)&x;
   i = 0x5f3759df - (i>>1);
   x = *(float*)&i;
   x = x*(1.5f - xhalf*x*x);
   return x;
}

in Rust? If it exists, post the code.

I tried it and failed. I don't know how to encode the float number using integer format. Here is my attempt:

fn main() {
    println!("Hello, world!");
    println!("sqrt1: {}, ",sqrt2(100f64));
}

fn sqrt1(x: f64) -> f64 {
    x.sqrt()
}

fn sqrt2(x: f64) -> f64 {
    let mut x = x;
    let xhalf = 0.5*x;
    let mut i = x as i64;
    println!("sqrt1: {}, ", i);

    i = 0x5f375a86 as i64 - (i>>1);

    x = i as f64;
    x = x*(1.5f64 - xhalf*x*x);
    1.0/x
}

Reference:
1. Origin of Quake3's Fast InvSqrt() - Page 1
2. Understanding Quake’s Fast Inverse Square Root
3. FAST INVERSE SQUARE ROOT.pdf
4. source code: q_math.c#L552-L572

the C# Version: [Is it possible to write Quake's fast InvSqrt() function in C#?](https://stackoverflow.com/questions/268853/is-it-possible-to-write-quakes-fast-invsqrt-function-in-c) — Flyq, Nov 28 '19 at 04:46
As I understand it, this code is UB in C due to violating the [strict aliasing rule](https://stackoverflow.com/questions/98650/what-is-the-strict-aliasing-rule). The standard-blessed way to perform this kind of type punning is with a `union`. — trent, Nov 28 '19 at 12:03
@trentcl: I don't think `union` works either. `memcpy` definitely works, though it's verbose. — Matthieu M., Nov 28 '19 at 16:10
@MatthieuM. Type punning with unions is [perfectly valid C](https://stackoverflow.com/a/25672839/7366707), but not valid C++. — Salem, Nov 28 '19 at 16:43
@Bergi Threre is a logical bug, hope it won't bother you: [wrong code](https://github.com/flyq/datastruct-algorithm/blob/f55898ada268e2525b7117ca2cb134a56494c7c2/others/fast_sqrt/sqirt/src/main.rs) — Flyq, Nov 29 '19 at 08:54
I suppose this question is fine from a pure-curiosity perspective, but please understand that times have changed. On x86, the `rsqrtss` and `rsqrtps` instructions, introduced with the Pentium III in 1999, are faster and more accurate than this code. ARM NEON has `vrsqrte` which is similar. And whatever calculations Quake III used this for would probably be done on the GPU these days anyway. — benrg, Nov 30 '19 at 02:35

Lukas Kalbertodt · Answer 1 · 2019-12-04T20:31:02.177

I don't know how to encode the float number using integer format.

There is a function for that: f32::to_bits which returns an u32. There is also the function for the other direction: f32::from_bits which takes an u32 as argument. These functions are preferred over mem::transmute as the latter is unsafe and tricky to use.

With that, here is the implementation of InvSqrt:

fn inv_sqrt(x: f32) -> f32 {
    let i = x.to_bits();
    let i = 0x5f3759df - (i >> 1);
    let y = f32::from_bits(i);

    y * (1.5 - 0.5 * x * y * y)
}

(Playground)

This function compiles to the following assembly on x86-64:

.LCPI0_0:
        .long   3204448256        ; f32 -0.5
.LCPI0_1:
        .long   1069547520        ; f32  1.5
example::inv_sqrt:
        movd    eax, xmm0
        shr     eax                   ; i << 1
        mov     ecx, 1597463007       ; 0x5f3759df
        sub     ecx, eax              ; 0x5f3759df - ...
        movd    xmm1, ecx
        mulss   xmm0, dword ptr [rip + .LCPI0_0]    ; x *= 0.5
        mulss   xmm0, xmm1                          ; x *= y
        mulss   xmm0, xmm1                          ; x *= y
        addss   xmm0, dword ptr [rip + .LCPI0_1]    ; x += 1.5
        mulss   xmm0, xmm1                          ; x *= y
        ret

I have not found any reference assembly (if you have, please tell me!), but it seems fairly good to me. I am just not sure why the float was moved into eax just to do the shift and integer subtraction. Maybe SSE registers do not support those operations?

clang 9.0 with -O3 compiles the C code to basically the same assembly. So that's a good sign.

It is worth pointing out that if you actually want to use this in practice: please don't. As benrg pointed out in the comments, modern x86 CPUs have a specialized instruction for this function which is faster and more accurate than this hack. Unfortunately, 1.0 / x.sqrt() does not seem to optimize to that instruction. So if you really need the speed, using the _mm_rsqrt_ps intrinsics is probably the way to go. This, however, does again require unsafe code. I won't go into much detail in this answer, as a minority of programmers will actually need it.

According to the Intel Intrinsics Guide there is no integer shift operation that only shifts the lowest 32-bit of the 128-bit register analog to `addss` or `mulss`. But if the other 96 bits of xmm0 can be ignored then one could use the `psrld` instruction. Same goes for integer subtraction. — fsasm, Nov 28 '19 at 14:12
I'll admit to knowing next to nothing about rust, but isn't "unsafe" basically a core property of fast_inv_sqrt? With it's total disrespect for datatypes and such. — Gloweye, Nov 28 '19 at 14:34
@Gloweye It's a different type of "unsafe" we talk about though. A fast approximation which gets a bad value too far from the sweet spot, versus something playing fast and loose with undefined behavior. — Deduplicator, Nov 28 '19 at 16:02
@Gloweye: Mathematically, the last part of that `fast_inv_sqrt` is just one Newton-Raphson iteration step to find a better approximation of `inv_sqrt`. There's nothing unsafe about that part. The trickery is in the first part, which finds a good approximation. That works because it's doing an integer divide by 2 on the exponent part of the float, and indeed `sqrt(pow(0.5,x))=pow(0.5,x/2)` — MSalters, Nov 29 '19 at 14:10
@fsasm: That's correct; `movd` to EAX and back is a missed optimization by current compilers. (And yes, calling conventions pass/return scalar `float` in the low element of an XMM and allow high bits to be garbage. But note that if it *was* zero-extended, it can easily stay that way: right shifting doesn't introduce non-zero elements and neither does subtraction from `_mm_set_epi32(0,0,0,0x5f3759df)`, i.e. a `movd` load. You would need a `movdqa xmm1,xmm0` to copy the reg before `psrld`. Bypass latency from FP instruction forwarding to integer and vice versa is hidden by `mulss` latency. — Peter Cordes, Nov 30 '19 at 10:31

edwardw · Answer 2 · 2019-11-29T13:02:43.937

45

This one is implemented with less known union in Rust:

union FI {
    f: f32,
    i: i32,
}

fn inv_sqrt(x: f32) -> f32 {
    let mut u = FI { f: x };
    unsafe {
        u.i = 0x5f3759df - (u.i >> 1);
        u.f * (1.5 - 0.5 * x * u.f * u.f)
    }
}

Did some micro benchmarks using criterion crate on a x86-64 Linux box. Surprisingly Rust's own sqrt().recip() is the fastest. But of course, any micro benchmark result should be taken with a grain of salt.

inv sqrt with transmute time:   [1.6605 ns 1.6638 ns 1.6679 ns]
inv sqrt with union     time:   [1.6543 ns 1.6583 ns 1.6633 ns]
inv sqrt with to and from bits
                        time:   [1.7659 ns 1.7677 ns 1.7697 ns]
inv sqrt with powf      time:   [7.1037 ns 7.1125 ns 7.1223 ns]
inv sqrt with sqrt then recip
                        time:   [1.5466 ns 1.5488 ns 1.5513 ns]

edited Nov 29 '19 at 13:02

answered Nov 28 '19 at 05:23

edwardw

12,652
3
40
51

27

I am not in the least surprised `sqrt().inv()` is fastest. Both sqrt and inv are single instructions these days, and go pretty fast. Doom was written in the days when it wasn't safe to assume there was hardware floating point at all, and transcendental functions like sqrt would *definitely* have been software. +1 for the benchmarks. – Martin Bonner supports Monica Nov 28 '19 at 17:14
4

What surprises me is that `transmute` is apparently different from `to_` and `from_bits` -- I'd expect those to be instruction-equivalent even before optimization. – trent Nov 28 '19 at 23:15
@MartinBonnersupportsMonica Just to note, that's been true for x86 for a long time, but on ARM we do need to check versions. The Cortex-M4 only has full hardware support for single-precision floats, where the Cortex-M7 does single- and double-precision. So a modern tablet should be fine, but little embedded things may not be. These kind of tricks are nice to remember for those of us who still need to count RAM in bytes. :) – Graham Nov 29 '19 at 12:40
@Graham An excellent point! (The benchmark will almost certainly have been done on a platform with hardware FP, but there are probably more installed processors out there that don't do FP than that do.) It's a while since I hard to care about space - and then I was counting in (16-bit) words. – Martin Bonner supports Monica Nov 29 '19 at 12:59
@Graham: this particular implementation is restricted to single-precision floats too, so even on an M4 it has limited value. – MSalters Nov 29 '19 at 14:12
@MSalters unless the range of double precision floats is required, this is an approximation algorithm so that I figured single precision probably makes more sense. – edwardw Nov 29 '19 at 14:44
@MartinBonner All x86 FPUs since the original 8087 have supported a hardware [`fsqrt`](https://www.felixcloutier.com/x86/fsqrt), as well as transcendental functions like `fsin`, etc. However they were microcoded and much slower than InvSqrt would have been. Today you should use `rsqrtss` or `rsqrtps` if you need a fast approximate inverse square root. – benrg Nov 30 '19 at 02:39
2

@MartinBonner (Also, not that it matters, but sqrt isn't a [transcendental function](https://en.wikipedia.org/wiki/Transcendental_function).) – benrg Nov 30 '19 at 02:40
4

@MartinBonner: Any hardware FPU that supports division will normally also support sqrt. IEEE "basic" operations (+ - * / sqrt) are required to produce a correctly-rounded result; that's why SSE provides all of those operations but not exp, sin, or whatever. In fact, divide and sqrt typically run on the same execution unit, designed a similar way. See [HW div/sqrt unit details](https://stackoverflow.com/questions/54642663/how-sqrt-of-gcc-works-after-compiled-which-method-of-root-is-used-newton-rap). Anyway, they're still not fast compared to multiply, especially in latency. – Peter Cordes Nov 30 '19 at 10:36
@benrg: `fsqrt` / `sqrtps` are not microcoded; they're IEEE Basic operations directly supported by HW as a single uop, like `rsqrtps`. See my previous comment. – Peter Cordes Nov 30 '19 at 10:38
@edwardw was your microbenchmark measuring latency or throughput? Modern div/sqrt units are somewhat pipelined but not fully. (e.g. Skylake ~12 cycle latency, one per 3 cycle throughput for independent `float` inputs, or somewhat worse for `double`). But multiple is more heavily pipelined, e.g. 4 cycle latency and 0.5 cycle reciprocal throughput (so 8 in flight simultaneously). Your times of 1.6ns are only ~6 clock cycles at 4GHz which is unreasonably short; you're probably measuring throughput of sqrt+div averaged over a repeat loop. Latency for HW sqrt + div is maybe better too. – Peter Cordes Nov 30 '19 at 10:43
1

Anyway, Skylake has significantly better pipelining for div/sqrt than previous uarches. See [Floating point division vs floating point multiplication](//stackoverflow.com/a/45899202) for some extracts from Agner Fog's table. If you aren't doing much other work in a loop so sqrt + div is a bottleneck, you might want to use HW fast reciprocal sqrt (instead of the quake hack) + a Newton iteration. Especially with FMA that's good for throughput, if not latency. [Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision](//stackoverflow.com/q/31555260) – Peter Cordes Nov 30 '19 at 10:44
1

@PeterCordes it is throughput indeed. I literally wrapped the function call in a closure and fed it to an iterator as dictated by `criterion` API, which in turn took care of warmup, measurement and statistical analysis. This is the Mandalorian^H^H^H^H `criterion` way. – edwardw Dec 01 '19 at 00:51

score 10 · Answer 3 · answered Nov 28 '19 at 05:05

10

You may use std::mem::transmute to make needed conversion:

fn inv_sqrt(x: f32) -> f32 {
    let xhalf = 0.5f32 * x;
    let mut i: i32 = unsafe { std::mem::transmute(x) };
    i = 0x5f3759df - (i >> 1);
    let mut res: f32 = unsafe { std::mem::transmute(i) };
    res = res * (1.5f32 - xhalf * res * res);
    res
}

You can look for a live example here: here

answered Nov 28 '19 at 05:05

Deedee Megadoodoo

843
8
20

5

There's nothing wrong with unsafe, but there's a way to do this without explicit unsafe block, so I'd suggest to rewrite this answer using [`f32::to_bits`](https://doc.rust-lang.org/std/primitive.f32.html#method.to_bits) and [`f32::from_bits`](https://doc.rust-lang.org/std/primitive.f32.html#method.from_bits). It also carries the intent clearly unlike transmute, which most people probably look at as "magic". – Nov 28 '19 at 07:37
6

@Sahsahae I just posted an answer using the two functions you mentioned :) And I agree, `unsafe` should be avoided here, as it's not necessary. – Lukas Kalbertodt Nov 28 '19 at 07:40

Is it possible to write Quake's fast InvSqrt() function in Rust?

3 Answers3

Linked